Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 10 2014

Four short links: 10 February 2014

  1. Bruce Sterling at transmediale 2014 (YouTube) — “if it works, it’s already obsolete.” Sterling does a great job of capturing the current time: spies in your Internet, lost trust with the BigCos, the impermanence of status quo, the need to create. (via BoingBoing)
  2. No-one Should Fork Android (Ars Technica) — this article is bang on. Google Mobile Services (the Play functionality) is closed-source, what makes Android more than a bare-metal OS, and is where G is focusing its development. Google’s Android team treats openness like a bug and routes around it.
  3. Data Pipelines (Hakkalabs) — interesting overview of the data pipelines of Stripe, Tapad, Etsy, and Square.
  4. Visualising Salesforce Data in Minecraft — would almost make me look forward to using Salesforce. Almost.

December 16 2013

Four short links: 16 December 2013

  1. Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
  2. NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
  3. Proofist — crowdsourced proofreading game.
  4. 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)

December 12 2013

Four short links: 12 December 2013

  1. iBeacons — Bluetooth LE enabling tighter coupling of physical world with digital. I’m enamoured with the interaction possibilities: The latest Apple TV software brought a fantastically clever workaround. You just tap your iPhone to the Apple TV itself, and it passes your Wi-Fi and iTunes credentials over and sets everything up instantaneously.
  2. Better and Better Keyboards (Jesse Vincent) — It suffered from the same problem as every other 3D-printed keyboard I’d made to date – When I showed it to someone, they got really excited about the fact that I had a 3D printer. In contrast, whenever I showed someone one of the layered acrylic prototype keyboards I’d built, they got excited about the keyboard.
  3. — open source modular web service for dataset storage and retrieval.
  4. state.jsOpen source JavaScript state machine supporting most UML 2 features.

December 04 2013

Four short links: 4 December 2013

  1. Skyjack — drone that takes over other drones. Welcome to the Malware of Things.
  2. Bootstrap Worlda curricular module for students ages 12-16, which teaches algebraic and geometric concepts through computer programming. (via Esther Wojicki)
  3. Harvestopen source BSD-licensed toolkit for building web applications for integrating, discovering, and reporting data. Designed for biomedical data first. (via Mozilla Science Lab)
  4. Project ILIAD — crowdsourced antibiotic discovery.

November 28 2013

November 06 2013

Four short links: 6 November 2013

  1. Apple Transparency Report (PDF) — contains a warrant canary, the statement Apple has never received an order under Section 215 of the USA Patriot Act. We would expect to challenge an order if served on us which will of course be removed if one of the secret orders is received. Bravo, Apple, for implementing a clever hack to route around excessive secrecy. (via Boing Boing)
  2. You’re Probably Polluting Your Statistics More Than You Think — it is insanely easy to find phantom correlations in random data without obviously being foolish. Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this. (via Stijn Debrouwere)
  3. CyPhy Funded (Quartz) — the second act of iRobot co-founder Helen Greiner, maker of the famed Roomba robot vacuum cleaner. She terrified ETech long ago—the audience were expecting Roomba cuteness and got a keynote about military deathbots. It would appear she’s still in the deathbot niche, not so much with the cute. Remember this when you build your OpenCV-powered recoil-resistant load-bearing-hoverbot and think it’ll only ever be used for the intended purpose of launching fertiliser pellets into third world hemp farms.
  4. User-Agent String History — a light-hearted illustration of why the formal semantic value of free-text fields is driven to zero in the face of actual use.

November 05 2013

Four short links: 5 November 2013

  1. Influx DBopen-source, distributed, time series, events, and metrics database with no external dependencies.
  2. Omega (PDF) — ���exible, scalable schedulers for large compute clusters. From Google Research.
  3. GraspJSSearch and replace your JavaScript code based on its structure rather than its text.
  4. Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.

October 28 2013

Four short links: 28 October 2013

  1. A Cyber Attack Against Israel Shut Down a RoadThe hackers targeted the Tunnels’ camera system which put the roadway into an immediate lockdown mode, shutting it down for twenty minutes. The next day the attackers managed to break in for even longer during the heavy morning rush hour, shutting the entire system for eight hours. Because all that is digital melts into code, and code is an unsolved problem.
  2. Random Decision Forests (PDF) — “Due to the nature of the algorithm, most Random Decision Forest implementations provide an extraordinary amount of information about the final state of the classifier and how it derived from the training data.” (via Greg Borenstein)
  3. BITalino — 149 Euro microcontroller board full of physiological sensors: muscles, skin conductivity, light, acceleration, and heartbeat. A platform for healthcare hardware hacking?
  4. How to Be a Programmer — a braindump from a guru.

October 25 2013

Four short links: 25 October 2013

  1. Seagate Kinetic Storage — In the words of Geoff Arnold: The physical interconnect to the disk drive is now Ethernet. The interface is a simple key-value object oriented access scheme, implemented using Google Protocol Buffers. It supports key-based CRUD (create, read, update and delete); it also implements third-party transfers (“transfer the objects with keys X, Y and Z to the drive with IP address”). Configuration is based on DHCP, and everything can be authenticated and encrypted. The system supports a variety of key schemas to make it easy for various storage services to shard the data across multiple drives.
  2. Masters of Their Universe (Guardian) — well-written and fascinating story of the creation of the Elite game (one founder of which went on to make the Raspberry Pi). The classic action game of the early 1980s – Defender, Pac Man – was set in a perpetual present tense, a sort of arcade Eden in which there were always enemies to zap or gobble, but nothing ever changed apart from the score. By letting the player tool up with better guns, Bell and Braben were introducing a whole new dimension, the dimension of time.
  3. Micropolar (github) — A tiny polar charts library made with D3.js.
  4. Introduction to R (YouTube) — 21 short videos from Google.

October 24 2013

Four short links: 24 October 2013

  1. Visually Programming Arduino — good for little minds.
  2. Rapid Hardware Iteration at Scale (Forbes) — It’s part of the unique way that Xiaomi operates, closely analyzing the user feedback it gets on its smartphones and following the suggestions it likes for the next batch of 100,000 phones. It releases them every Tuesday at noon Beijing time.
  3. Machine Learning of Hierarchical Clustering to Segment 2D and 3D Images (PLoS One) — We propose an active learning approach for performing hierarchical agglomerative segmentation from superpixels. Our method combines multiple features at all scales of the agglomerative process, works for data with an arbitrary number of dimensions, and scales to very large datasets.
  4. Kratuan Open Source client-side analysis framework to create simple yet powerful renditions of data. It allows you to dynamically adjust your view of the data to highlight issues, opportunities and correlations in the data.

October 22 2013

Four short links: 22 October 2013

  1. Sir Trevor — nice rich-text editing. Interesting how Markdown has become the way to store formatted text without storing HTML (and thus exposing the CSRF-inducing HTML-escaping stuckfastrophe).
  2. Slate for Excel — visualising spreadsheet structure. I’d be surprised if it took MSFT or Goog 30 days to acquire them.
  3. Project Shield — Google project to protect against DDoSes.
  4. Digital Attack Map — DDoS attacks going on around the world. (via Jim Stogdill)

August 10 2013

Four short links: 12 August 2013

  1. List of Malware pcaps and SamplesCurrently, most of the samples described have the corresponding samples and pcaps available for download.
  2. InterTwinkles — open source platform built from the ground up to help small democratic groups to do process online. It provides structure to improve the efficiency of specific communication tasks like brainstorming and proposals. (via Willow Bl00)
  3. Lavabit, Privacy, Seppuku, and Game Theory (Vikram Kumar) — Mega’s CEO’s private blog, musing about rational responses to malstates.
  4. Telepath Keylogger (Github) — A happy Mac keylogger for Quantified Self purposes. (via Nick Winter)

August 01 2013

Four short links: 2 August 2013

  1. Unhappy Truckers and Other Algorithmic ProblemsEven the insides of vans are subjected to a kind of routing algorithm; the next time you get a package, look for a three-letter letter code, like “RDL.” That means “rear door left,” and it is so the driver has to take as few steps as possible to locate the package. (via Sam Minnee)
  2. Fuel3D: A Sub-$1000 3D Scanner (Kickstarter) — a point-and-shoot 3D imaging system that captures extremely high resolution mesh and color information of objects. Fuel3D is the world’s first 3D scanner to combine pre-calibrated stereo cameras with photometric imaging to capture and process files in seconds.
  3. Corporate Open Source Anti-Patterns (YouTube) — Brian Cantrill’s talk, slides here. (via Daniel Bachhuber)
  4. Hacking for Humanity) (The Economist) — Getting PhDs and data specialists to donate their skills to charities is the idea behind the event’s organizer, DataKind UK, an offshoot of the American nonprofit group.

July 23 2013

Interactive map: bike movements in New York City and Washington, D.C.

From midnight to 7:30 A.M., New York is uncharacteristically quiet, its Citibikes–the city’s new shared bicycles–largely stationary and clustered in residential neighborhoods. Then things begin to move: commuters check out the bikes en masse in residential areas across Manhattan and, over the next two hours, relocate them to Midtown, the Flatiron district, SoHo, and Wall Street. There they remain concentrated, mostly used for local trips, until they start to move back outward around 5 P.M.

Washington, D.C.’s bike-share program exhibits a similar pattern, though, as you’d expect, the movement starts a little earlier in the morning. On my animated map, both cities look like they’re breathing–inhaling and then exhaling once over the course of 12 hours or so.

The map below shows availability at bike stations in New York City and the Washington, D.C. area across the course of the day. Solid blue dots represent completely-full bike stations; white dots indicate empty bike stations. Click on any station to see a graph of average availability over time. I’ve written a few thoughts on what this means about the program below the graphic.

We can see some interesting patterns in the bike share data here. First of all, use of bikes for commuting is evidently highest in the residential areas immediately adjacent to dense commercial areas. That makes sense; a bike commute from the East Village to Union Square is extremely easy, and that’s also the sort of trip that tends to be surprisingly difficult by subway. The more remote bike stations in Brooklyn and Arlington exhibit fairly flat availability profiles over the course of the day, suggesting that to the degree they’re used at all, it’s mostly for local trips.

A bit about the map: I built this by scraping the data feeds that underlie the New York and Washington real-time availability maps every 10 minutes and storing them in a database. (Here is New York’s feed; here is Washington’s.) I averaged availability by station in 10-minute increments over seven weekdays of collected data. The map uses JavaScript (mostly jQuery) to manipulate an SVG image–changing opacity of bike-share stations depending to represent availability and rendering a graph every time a station is clicked. I used Python and MySQL for the back-end work of collecting the data, aggregating it, and publishing it to a JSON file that the front-end code downloads and parses.

This map, by the way, is an extremely simple example of what’s possible when the physical world is instrumented and programmable. I’ve written about sensor-laden machinery in my research report on the industrial internet, and we plan to continue our work on the programmable world in the coming months.

June 11 2013

Four short links: 11 June 2013

  1. For Example — amazing discussion of 3D visualization techniques, full of examples using the D3.js library and example gist system. Gorgeous and informative.
  2. Anti-Gravity 3D Printer — uses strands to sculpt on any surface. (via Slashdot)
  3. How 3D Printing Will Rebuild Reality (BoingBoing) — But even though home 3D-printing has received substantial publicity of late, it is in the industrial sector where the technology will probably make its most significant near-term impact on the world both by manufacturing improved commercial products and by stimulating industry to develop next-generation fab methods and machines that could one day truly bring 3D-printing home to users in a real way.
  4. The Emotional Side of Big Data — Personal Democracy Forum 2013 talk by Sara Critchfield, on reframing emotion as data for decision-making. (via Quartz)

June 03 2013

Big data vs. big reality

This post originally appeared on Cumulus Partners. It’s republished with permission.

Quentin Hardy’s recent post in the Bits blog of The New York Times touched on the gap between representation and reality that is a core element of practically every human enterprise. His post is titled “Why Big Data is Not Truth,” and I recommend it for anyone who feels like joining the phony argument over whether “big data” represents reality better than traditional data.

In a nutshell, this “us” versus “them” approach is like trying to poke a fight between oil painters and water colorists. Neither oil painting nor water colors are “truth”; both are forms of representation. And here’s the important part: Representation is exactly that — a representation or interpretation of someone’s perceived reality. Pitting “big data” against traditional data is like asking you if Rembrandt is more “real” than Gainsborough. Both of them are artists and both painted representations of the world they perceived around them.

The problem with false arguments like the one posed by Hardy is that they obscure the value of data — traditional data and big data — and the impact of data on our culture. I’m now working my way through “Raw Data” is an Oxymoron, an anthology of short essays about data. I recommend it for anyone who is seriously interested in thinking about the many ways in which data has influenced (and continues influencing) our lives. I especially recommend “facts and FACTS: Abolitionists’ Database Innovations,” by Ellen Gruber Garvey. As its title suggests, the essay focuses on what proves to be an absolutely fascinating period of U.S. history in which the anti-slavery movement harvested data from real advertisements in Southern newspapers to paint a vivid and believable picture of the routine horrors inflicted by the slave system on real human beings.

That 19th century use of data mining built support for the anti-slavery movement, both in the U.S. and in England. The data played a key role in making the case for abolishing slavery — even though it required the bloodiest war in U.S. history to make abolition a fact.

Data itself has no quality. It’s what you do with it that counts.

May 28 2013

Four short links: 28 May 2013

  1. My Little Geek — children’s primer with a geeky bent. A is for Android, B is for Binary, C is for Caffeine …. They have a Kickstarter for two sequels: numbers and shapes.
  2. Visible CSS RulesEnter a url to see how the css rules interact with that page.
  3. How to Work Remotely — none of this is rocket science, it’s all true and things we had to learn the hard way.
  4. Raspberry Pi Twitter Sentiment Server — step-by-step guide, and github repo for the lazy. (via Jason Bell)

May 22 2013

Four short links: 22 May 2013

  1. XBox One Kinect Controller (Guardian) — the new Kinect controller can detect gaze, heartbeat, and the buttons on your shirt.
  2. Surveillance and the Internet of Things (Bruce Schneier) — Lots has been written about the “Internet of Things” and how it will change society for the better. It’s true that it will make a lot of wonderful things possible, but the “Internet of Things” will also allow for an even greater amount of surveillance than there is today. The Internet of Things gives the governments and corporations that follow our every move something they don’t yet have: eyes and ears.
  3. Daniel Dennett’s Intuition Pumps (extract)How to compose a successful critical commentary: 1. Attempt to re-express your target’s position so clearly, vividly and fairly that your target says: “Thanks, I wish I’d thought of putting it that way.” 2. List any points of agreement (especially if they are not matters of general or widespread agreement). 3. Mention anything you have learned from your target.4. Only then are you permitted to say so much as a word of rebuttal or criticism.
  4. New Data Science Toolkit Out (Pete Warden) — with population data to let you compensate for population in your heatmaps. No more “gosh, EVERYTHING is more prevalent where there are lots of people!” meaningless charts.

May 14 2013

Four short links: 14 May 2013

  1. Behind the Banner — visualization of what happens in the 150ms when the cabal of data vultures decide which ad to show you. They pass around your data as enthusiastically as a pipe at a Grateful Dead concert, and you’ve just as much chance of getting it back. (via John Battelle)
  2. pwnpad — Nexus 7 with Android and Ubuntu, high-gain USB bluetooth, ethernet adapter, and a gorgeous suite of security tools. (via Kyle Young)
  3. Terraa simple, statically-typed, compiled language with manual memory management [...] designed from the beginning to interoperate with Lua. Terra functions are first-class Lua values created using the terra keyword. When needed they are JIT-compiled to machine code. (via Hacker News)
  4. Metaphor Identification in Large Texts Corpora (PLOSone) — The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.

May 06 2013

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. And here we see what’s really happening: to remove one head of the double-headed arrow, we use “common sense” to choose between two stories: one that’s merely silly, and another that’s so ludicrous we never even think about it. Seems to work here (for a very limited value of “work”); but if I’ve learned one thing, it’s that good old common sense is frequently neither common nor sensible. For more realistic correlations, it certainly seems ironic that we’re doing all this data analysis just to end up relying on common sense.

Now let’s look at something equally hypothetical that isn’t silly. A drug is correlated with reduced risk of death due to heart failure. Good thing, right? Yes–but why? What if the drug has nothing to do with heart failure, but is really an anti-depressant that makes you feel better about yourself so you exercise more? If you’re in the “correlation is as good as causation” club, doesn’t make a difference: you win either way. Except that, if the key is really exercise, there might be much better ways to achieve the same result. Certainly much cheaper, since the drug industry will no doubt price the pills at $100 each. (Tangent: I once saw a truck drive up to an orthopedist’s office and deliver Vioxx samples with a street value probably in the millions…) It’s possible, given some really interesting work being done on the placebo effect, that a properly administered sugar pill will make the patient feel better and exercise, yielding the same result. (Though it’s possible that sugar pills only work as placebos if they’re expensive.) I think we’d like to know, rather than just saying that correlation is just as good as causation, if you have a lot of data.

Perhaps I haven’t gone far enough: with enough data, and enough dimensions to the data, it would be possible to detect the correlations between the drug, psychological state, exercise, and heart disease. But that’s not the point. First, if correlation really is as good as causation, why bother? Second, to analyze data, you have to collect it. And before you collect it, you have to decide what to collect. Data is socially constructed (I promise, this will be the subject of another post), and the data you don’t decide to collect doesn’t exist. Decisions about what data to collect are almost always driven by the stories we want to tell. You can have petabytes of data, but if it isn’t the right data, if it’s data that’s been biased by preconceived notions of what’s important, you’re going to be misled. Indeed, any researcher knows that huge data sets tend to create spurious correlations.

Causation has its own problems, not the least of which is that it’s impossible to prove. Unfortunately, that’s the way the world works. But thinking about cause and how events relate to each other helps us to be more critical about the correlations we discover. As humans we’re storytellers, and an important part of data work is building a story around the data. Mere correlations arising from a gigantic pool of data aren’t enough to satisfy us. But there are good stories and bad ones, and just as it’s possible to be careful in designing your experiments, it’s possible to be careful and ethical in the stories you tell with your data. Those stories may be the closest we get ever get to an understanding of cause; but we have to realize that they’re just stories, that they’re provisional, and that better evidence (which may just be correlations) may force us to retell our stories at any moment. Correlation is as good as causation is just an excuse for intellectual sloppiness; it’s an excuse to replace thought with an odd kind of “common sense,” and to shut down the discussion that leads to good stories and understanding.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!