Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 12 2013

Four short links: 12 April 2013

  1. Wikileaks ProjectK Code (Github) — open-sourced map and graph modules behind the Wikileaks code serving Kissinger-era cables. (via Journalism++)
  2. Plan Your Digital Afterlife With Inactive Account Manageryou can choose to have your data deleted — after three, six, nine or 12 months of inactivity. Or you can select trusted contacts to receive data from some or all of the following services: +1s; Blogger; Contacts and Circles; Drive; Gmail; Google+ Profiles, Pages and Streams; Picasa Web Albums; Google Voice and YouTube. Before our systems take any action, we’ll first warn you by sending a text message to your cellphone and email to the secondary address you’ve provided. (via Chris Heathcote)
  3. Leo Caillard: Art GamesCaillard’s images show museum patrons interacting with priceless paintings the way someone might browse through slides in a personal iTunes library on a device like an iPhone or MacBook. Playful and thought-provoking. (via Beta Knowledge)
  4. Lanyrd Pro — helping companies keep track of which events their engineers speak at, so they can avoid duplication and have maximum opportunity to promote it. First paid product from ETecher and Foo Simon Willison’s startup.

April 11 2013

Data skepticism

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

Predictive analytics and data sharing raise civil liberties concerns

Last winter, around the same time there was a huge row in Congress over the Cyber Intelligence Sharing and Protection Act (CISPA), U.S. Attorney General Holder quietly signed off on expanded rules on government data sharing. The rules allowed the National Counterterrorism Center (NCTC), housed within the Department of Homeland Security, to analyze the regulatory data collected during the business of government for patterns relevant to domestic terrorist threats.

Julia Angwin, who reported the story for the Wall Street Journal, highlighted the key tension: the rules allow the NCTC to “examine the government files of U.S. citizens for possible criminal behavior, even if there is no reason to suspect them.” 

On the one hand, this is a natural application of big data: search existing government records collected about citizens for suspicious patterns of behavior. The action can be justified for counter-terrorism purposes: there are advanced persistent threats. (When national security is invoked, privacy concerns are often deprecated.) The failure to "connect the dots" using existing data across government on Christmas Day 2009 (remember the so-called "underwear bomber?") added impetus to getting more data in the NCTC’s hands. It’s possible that the rules on data retention were extended five years because the agency didn’t have the capabilities it needed. Data mining existing records offers unprecedented opportunities to find and detect terrorism plots before they happen.

On the other hand, the changes at the NCTC that were authorized back in March 2012 represent a massive data grab with far-reaching consequences. The changes received little public discussion prior to the WSJ breaking the story, and they seem to substantially override the purpose of the Federal Privacy Act that Congress passed in 1974. Extension of the rules happened without public debate because of what effectively amounts to a legal loophole. Post proposed changes to the Federal Register, voila. Effectively, this looks like an end run around the Federal Privacy Act.

Here’s the rub: according to Angwin, DoJ Chief Privacy Officer Nancy Libin:

“… raised concerns about whether the guidelines could unfairly target innocent people, these people said. Some research suggests that, statistically speaking, there are too few terror attacks for predictive patterns to emerge. The risk, then, is that innocent behavior gets misunderstood — say, a man buying chemicals (for a child’s science fair) and a timer (for the sprinkler) sets off false alarms. An August government report indicates that, as of last year, NCTC wasn’t doing predictive pattern-matching.”

It’s hard to say whether predictive data analytics are now in use at NCTC. It would be surprising if there isn’t pressure to experiment, given the expansion of “predictive policing” in cities around the U.S.. There stand to be significant, long-lasting repercussions if the center builds capacity to apply that capability at large scale without great care and informed Congressional oversight.

One outcome is a dystopian scenario straight out of science fiction, from “thoughtcrime” to presumptions of guilt. Alistair Croll highlighted some of the associated issues involved with big data and civil rights last year.

As Angwin pointed out, the likelihood of a terrorist attack in the U.S. remains low as compared to other risks Americans face every day from traffic, bees or lifestyle decisions. After 9/11, however, public officials and Congress have had little risk tolerance. As a result, vast, expensive intelligence and surveillance infrastructure in the U.S. has been massively expanded, with limited oversight and very little accountability, as documented in “Top Secret America.”

When intelligence officials have gone public to the press as whistle-blowers regarding overspending, they have been prosecuted. Former National Security Agency staffer Thomas Drake spoke at the 2011 Web 2.0 Summit about his experience. We talked about it in a subsequent interview, below:

The new rules have been in place now for months, with little public comment upon the changes. (Even after it was relaunched, the nation doesn’t seem to be reading the Federal Register. These days, I’m not sure how many members of the DC media do, either.) I’m unsure whether it’s fair to blame the press, though I do wonder how media resources were allocated during the “horse race” of the presidential campaigns last year. Now, the public is left to hope that the government oversees itself effectively behind closed doors.

I would find a recent “commitment to privacy and civil liberties” by the Department of Homeland Security more convincing if the agency wasn’t confiscating and searching electronic devices at the border without a warrant. 

Does anyone think that the privacy officers whose objections were overruled in the internal debates will provide the effective counterweight protecting the Bill of Rights will require in the years to come?

Four short links: 11 April 2013

  1. A General Technique for Automating NES Gamessoftware that learns how to play NES games and plays them automatically, using an aesthetically pleasing technique. With video, research paper, and code.
  2. rietveld — open source tool like Mondrian, Google’s code review tool. Developed by Guido van Rossum, who developed Mondrian. Still being actively developed. (via Nelson Minar)
  3. KPI Dashboard for Early-Stage SaaS Startups — as Google Docs sheet. Nice.
  4. Life Without Sleep — interesting critique of Provigil as performance-enhancing drug for information workers. It is very difficult to design a stimulant that offers focus without tunnelling – that is, without losing the ability to relate well to one’s wider environment and therefore make socially nuanced decisions. Irritability and impatience grate on team dynamics and social skills, but such nuances are usually missed in drug studies, where they are usually treated as unreliable self-reported data. These problems were largely ignored in the early enthusiasm for drug-based ways to reduce sleep. [...] Volunteers on the stimulant modafinil omitted these feedback requests, instead providing brusque, non-question instructions, such as: ‘Exit West at the roundabout, then turn left at the park.’ Their dialogues were shorter and they produced less accurate maps than control volunteers. What is more, modafinil causes an overestimation of one’s own performance: those individuals on modafinil not only performed worse, but were less likely to notice that they did. (via Dave Pell)

April 10 2013

Four short links: 10 April 2013

  1. HyperLapse — this won the Internet for April. Everyone else can go home. Check out this unbelievable video and source is available.
  2. Housing Simulator — NZ’s largest city is consulting on its growth plan, and includes a simulator so you can decide where the growth to house the hundreds of thousands of predicted residents will come from. Reminds me of NPR’s Budget Hero. Notice that none of the levers control immigration or city taxes to make different cities attractive or unattractive. Growth is a given and you’re left trying to figure out which green fields to pave.
  3. Converting To and From Google Map Tile Coordinates in PostGIS (Pete Warden) — Google Maps’ system of power-of-two tiles has become a defacto standard, widely used by all sorts of web mapping software. I’ve found it handy to use as a caching scheme for our data, but the PostGIS calls to use it were getting pretty messy, so I wrapped them up in a few functions. Code on github.
  4. So You Want to Build A Connected Sensor Device? (Google Doc) — The purpose of this document is to provide an overview of infrastructure, options, and tradeoffs for the parts of the data ecosystem that deal with generating, storing, transmitting, and sharing data. In addition to providing an overview, the goal is to learn what the pain points are, so we can address them. This is a collaborative document drafted for the purpose of discussion and contribution at Sensored Meetup #10. (via Rachel Kalmar)

April 09 2013

The re-emergence of time-series

My first job after leaving academia was as a quant 1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I’ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.

Time-series and big data

Over the last six months I’ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition & user interface design), sensors (apps for “self-tracking”), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.

Search and machine-learning at scale

Before doing anything else, one has to be able to run queries at scale. Last year I wrote about a team of researchers at UC Riverside who took an existing search algorithm (dynamic time-warping 2) and got it to scale to time-series with trillions of points. There are many potential applications of their research, one I highlighted is from health care:

… a doctor who needs to search through EEG data (with hundreds of billions of points), for a “prototypical epileptic spike”, where the input query is a time-series snippet with thousands of points.

As the size of data grows, the UCR dynamic time-warping algorithm takes time to finish (it takes a few hours for time-series with trillions of points). In general (academic) researchers who’ve spent weeks or months collecting data are fine waiting a few hours for a pattern recognition algorithm to finish. But users who come from different backgrounds (e.g. web companies) may not be as patient. Fortunately “search” is an active research area and faster (distributed) pattern recognition systems will likely emerge soon.

Once you scale up search, other interesting problems can be tackled. The UCR team is using their dynamic time-warping algorithm in tasks like classification, clustering, and motif 3 discovery. Other teams are investigating techniques from signal-processing, pattern recognition, and trajectory tracking.

Some data management tools that target time-series

One of the more popular sessions at last year’s HBase Conference was on OpenTSDB, a distributed, time series database built on top of HBase. It’s used to store and serve time series metrics, and comes with tools (based on GNUPlot) for charting. Originally named OpenTSDB2, KairosDB was written primarily for Cassandra (but also works with HBase). OpenTSDB emphasizes tools for readying data for charts (interpolating to fill in missing values), KairosDB distinguishes between data and the presentation of data.

Startup TempoDB offers a reasonably priced, cloud-based service for storing, retrieving, and visualizing time-series data. Still a work in progress SciDB is an open source database project, designed specifically for data intensive science problems. The designers of the system plan to make time-series analysis easy to express within SciDB.


(1) I worked on trading strategies for derivatives, portfolio & risk management, and option pricing.

(2) From my earlier post: In a recent paper, the UCR team noted that “… after an exhaustive literature search of more than 800 papers, we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments”.

(3) Motifs are similar subsequences of a long time series; shapelets are time series primitives that can be used to speed up automatic classification (by reducing the number of “features”).

This post was originally published on strata.oreilly.com.

Privacy vs. speech

A week or so ago this link made its way through my tweet stream: “Privacy and the right to be forgotten.” Honestly I didn’t really even read it. I just retweeted it with a +1 or some other sign of approval because the notion that my flippant throwaway comments on the interwebs would be searchable forever has always left me a bit unsettled. Many times I’ve thought “Thank God the Internet wasn’t around when I was 20, because the things I would have said then online would have been order of magnitudes stupider than the stupidest things I say now.” I haven’t gotten any smarter, but I am a little bit better at filtering, and I rarely drink these days.

But today I read this piece from Stanford Law Review on the subject. And it’s smart. As is this simpler summary on NPR.

In so many domains the Internet creates these dichotomous tensions. There are two things we want and the Internet enables either, or neither, but not both.

I personally don’t think we need this kind of law. However, eventually it will become obvious that the cost of storing every damned thing I’ve ever uttered online exceeds any conceivable or achievable ROI from mining it. Hopefully, as companies realize this, they’ll offer a “feature” to solve this problem by letting me, and people like me, establish preferences for time to live and/or time to keep. For example, I’d be perfectly happy if Twitter enabled a one week time to live on every tweet I posted. They are meant to be ephemeral and it would be more than fine with me if their lifespans matched the level of thought I put into them.

April 08 2013

An IPO by any other name

Times Square - NASDAQ by luisvilla, on FlickrTimes Square - NASDAQ by luisvilla, on FlickrWhen Tableau goes public this summer, its shares will trade on NASDAQ under the apt ticker symbol “DATA.” Tickers are arguably less important now than they’ve ever been, since computers have removed much of the ambiguity they’re meant to resolve, but an interesting ticker symbol always stirs my fascination with corporate archaeology.

Some executives see prestige in quirky ticker symbols, or those that are just one letter long, and the exchanges take advantage of that to attract listings. It was rumored, for instance, that the New York Stock Exchange was reserving “M” and “I” for Microsoft and Intel, respectively, should they ever decide to ditch NASDAQ, and it finally gave up M in 2007 when Federated Department Stores re-listed itself as Macy’s. Since then, the exchanges have fallen over themselves to list tech companies, and NASDAQ and the New York Stock Exchange have handed over “Z” to Zillow and “P” to Pandora.

But the list of single-letter ticker symbols is strikingly Ozymandian. Two letters ahead of Zillow, and only about 43% larger by valuation, is United States Steel, the biggest corporation in the world at its founding by J.P. Morgan in 1901, but marginal enough 90 years later that it was removed from the Dow Jones Industrial Average.

At “Y” on the New York Stock Exchange is Alleghany Corp., today a smallish insurance firm, but at one time a swashbuckling holding company that controlled almost a fifth of the U.S. railway network, including the Chesapeake & Ohio and the New York Central, one of the country’s biggest, which it won in a dramatic proxy battle. It was a proto-conglomerate in the eventual mold of ITT and Gulf+Western, and in the middle of the 20th century it did as many railroad companies, and got the hell out of the railroad business. Its ancillary investments, after a few acquisitions and divestments, added up to an insurance business. (When, in 1970, Penn Central Transportation filed for the biggest bankruptcy to date, its parent company similarly survived and held an insurance business alongside some of the railroad’s real estate. Its successor company, American Premier Underwriters, continued to own New York’s Grand Central Terminal and the air rights above its tracks, leasing them to New York’s Metropolitan Transportation Authority, until 2006 when it sold them to a group of real-estate investors.)

I thought of closing this post by clarifying that I don’t wish the fate of U.S. Steel and Alleghany on Tableau and Zillow, but a century of existence for any company is a remarkable achievement, particularly for one that came about at a high point in investor enthusiasm for its industry.

Photo: Times Square – NASDAQ by luisvilla, on Flickr

Four short links: 8 April 2013

  1. mozpaya JavaScript API inspired by google.payments.inapp.buy() but modified for things like multiple payment providers and carrier billing. When a web app invokes navigator.mozPay() in Firefox OS, the device shows a secure window with a concise UI. After authenticating, the user can easily charge the payment to her mobile carrier bill or credit card. When completed, the app delivers the product. Repeat purchases are quick and easy.
  2. Firefox Looks Like it Will Reject Third-Party Cookies (ComputerWorld) — kudos Mozilla! Now we’ll see whether such a cookie policy does deliver a better user experience. Can privacy coexist with a good user experience? Answers on a tweet, please, to @radar.
  3. How We Lost the Web (Anil Dash) — excellent talk about the decreasing openness and vanishing shared culture of the web. See also David Weinberger’s transcription.
  4. 3D From Space Shuttle Footage? — neat idea! Filming in 3D generally requires two cameras that are separated laterally, to create the parallax effected needed for stereoscopic vision. Fortunately, videos shot from Earth orbit can be converted to 3D without a second camera, because the camera is constantly in motion.

April 05 2013

Magic

Any sufficiently advanced technology is indistinguishable from magic.
– Arthur C. Clarke

I spent Wednesday at Penn Medicine’s Connected Health event in Philadelphia. We saw an array of technologies that wouldn’t even have been imaginable when I came into this world. Mobile telepresence systems, tele surgery, the ability to remotely detect depression with merely a phone and its analysis, real-time remote glucose monitoring, and on and on.

But nothing in technology surprises me anymore. I have Meh’monia, a condition wherein all of the magic and surprise has been drained out of technology, probably by Apple. Today I expect anything that can be imagined to be possible, available, and to be executed beautifully.

A tiny but powerful computer in my pocket with greater than VGA screen resolution? Meh. Glasses with interactive heads up display? I’ll take the designer version. Hall-roaming robots that bring me my meds and let me make video calls to my family? I saw that on the Jetsons.

On my way home I dropped in at the Penn Museum and spent an hour roaming the collection. Two days later the magic I’m still thinking about is the magic in those galleries. Atoms arranged with human intellect (and vast amounts of human labor) into form with awe-inspiring scale and beauty. Many of the objects on display left me transfixed.

magicmagic

I can believe that almost anything can be designed and manufactured in modern facilities with modern methods, but the idea of a perfect 50-pound crystal sphere emerging from a piece of rock with nothing but years of hand labor seems like magic to modern me. As does a 12-ton sphinx of red granite that was quarried 600 miles from where it was carved.

The technology of our virtual world, which until very recently inspired such a sense of magic in me, has become the every day. And for me at least, those artifacts of a previous physical world now seem like the work of ancient magicians.

Four short links: 5 April 2013

  1. Millimetre-Accuracy 3D Imaging From 1km Away (The Register) — With further development, Heriot-Watt University Research Fellow Aongus McCarthy says, the system could end up both portable and with a range of up to 10 Km. See the paper for the full story.
  2. Robot Ants With Pheromones of Light (PLoS Comp Biol) — see also the video. (via IEEE Spectrum’s AI blog)
  3. tabula — open source tool for liberating data tables trapped inside PDF files. (via Source)
  4. There’s No Economic Imperative to Reconsider an Open Internet (SSRN) — The debate on the neutrality of Internet access isn’t new, and if its intensity varies over time, it has for a long while tainted the relationship between Internet Service Providers (ISPs) and Online Service Providers (OSPs). This paper explores the economic relationship between these two types of players, examines in laymen’s terms how the traffic can be routed efficiently and the associated cost of that routing. The paper then assesses various arguments in support of net discrimination to conclude that there is no threat to the internet economy such that reconsidering something as precious as an open internet would be necessary. (via Hamish MacEwan)

April 04 2013

Four short links: 4 April 2013

  1. geo-bootstrap — Twitter Bootstrap fork that looks like a classic geocities page. Because. (via Narciso Jaramillo)
  2. Digital Public Library of America — public libraries sharing full text and metadata for scans, coordinating digitisation, maximum reuse. See The Verge piece. (via Dan Cohen)
  3. Snake Robots — I don’t think this is a joke. The snake robot’s versatile abilities make it a useful tool for reaching locations or viewpoints that humans or other equipment cannot. The robots are able to climb to a high vantage point, maneuver through a variety of terrains, and fit through tight spaces like fences or pipes. These abilities can be useful for scouting and reconnaissance applications in either urban or natural environments. Watch the video, the nightmares will haunt you. (via Aaron Straup Cope)
  4. The Power of Data in Aboriginal Hands (PDF) — critique of government statistical data gathering of Aboriginal populations. That ABS [Australian Bureau of Statistics] survey is designed to assist governments, commentators or academics who want to construct policies that shape our lives or encourage a one-sided public discourse about us and our position in the Australian nation. The survey does not provide information that Indigenous people can use to advance our position because the data is aggregated at the national or state level or within the broad ABS categories of very remote, remote, regional or urban Australia. These categories are constructed in the imagination of the Australian nation state. They are not geographic, social or cultural spaces that have relevance to Aboriginal people. [...] The Australian nation’s foundation document of 1901 explicitly excluded Indigenous people from being counted in the national census. That provision in the constitution, combined with Section 51, sub section 26, which empowered the Commonwealth to make special laws for ‘the people of any race, other than the Aboriginal race in any State’ was an unambiguous and defining statement about Australian nation building. The Founding Fathers mandated the federated governments of Australia to oversee the disappearance of Aboriginal people in Australia.

April 03 2013

Four short links: 3 April 2013

  1. Capn Proto — open source faster protocol buffers (binary data interchange format and RPC system).
  2. Saddle — a high performance data manipulation library for Scala.
  3. Vegaa visualization grammar, a declarative format for creating, saving and sharing visualization designs. (via Flowing Data)
  4. dumpmon — Twitter bot that monitors paste sites for password dumps and other sensitive information. Source on github, see the announcement for more.

April 02 2013

If you’ve ever wondered where those O’Reilly animal covers come from …

Exploring ExpectExploring ExpectThe exchange often goes like this:

Stranger: “Where do you work?”

Me: “O’Reilly Media.”

Stranger: “O’Reilly …”

[Long pause while he or she works through the various "O'Reilly" outlets — the TV guy, the auto parts company.]

Me: “You know the books with the animals on the covers?”

Stranger: “Oh yeah!”

And off we go. Those covers are tremendous ice breakers.

The story behind those covers is also notable. Our colleague Edie Freedman, O’Reilly’s creative director and the person who first made the connection between animal engravings and programming languages, has written a short piece about the genesis of the O’Reilly animals. If you’ve ever wondered where those animals came from, her post is worth a read.

(Something I learned from Edie’s post: the covers that get the best response feature 1. animals with recognizable faces and 2. animals that are looking directly at the reader.)

Edie’s “Short history of the O’Reilly Animals” is part of a larger effort to raise awareness for the plight of the O’Reilly animals, many of which are critically endangered. You can learn more about the O’Reilly Animals project here.

Four short links: 2 April 2013

  1. Analyzing mbostock’s queue.js — beautiful walkthrough of a small library, showing the how and why of good coding.
  2. What Job Would You Hire a Textbook To Do? (Karl Fisch) — notes from a Discovery Education “Beyond the Textbook” event. The issues Karl highlights for textbooks (why digital, etc.) are there for all books as we create this new genre.
  3. Neutralizing Open Access (Glyn Moody) — the publishers appear to have captured the UK group implementing the UK’s open access policy. At every single step of the way, the RCUK policy has been weakened. From being the best and most progressive in the world, it’s now considerably weaker than policies already in action elsewhere in the world, and hardly represents an increment on their 2006 policy. What’s at stake? Opportunity to do science faster, to provide source access to research for the public, and to redirect back to research the millions of pounds spent on journal subscriptions.
  4. Turn the Raspberry Pi into a VPN Server (LinuxUser) — One possible scenario for wanting a cheap server that you can leave somewhere is if you have recently moved away from home and would like to be able to easily access all of the devices on the network at home, in a secure manner. This will enable you to send files directly to computers, diagnose problems and other useful things. You’ll also be leaving a powered USB hub connected to the Pi, so that you can tell someone to plug in their flash drive, hard drive etc and put files on it for them. This way, they can simply come and collect it later whenever the transfer has finished.

April 01 2013

Four short links: 1 April 2013

  1. MLDemosan open-source visualization tool for machine learning algorithms created to help studying and understanding how several algorithms function and how their parameters affect and modify the results in problems of classification, regression, clustering, dimensionality reduction, dynamical systems and reward maximization. (via Mark Alen)
  2. kiln (GitHub) — open source extensible on-device debugging framework for iOS apps.
  3. Industrial Internet — the O’Reilly report on the industrial Internet of things is out. Prasad suggests an illustration: for every car with a rain sensor today, there are more than 10 that don’t have one. Instead of an optical sensor that turns on windshield wipers when it sees water, imagine the human in the car as a sensor — probably somewhat more discerning than the optical sensor in knowing what wiper setting is appropriate. A car could broadcast its wiper setting, along with its location, to the cloud. “Now you’ve got what you might call a rain API — two machines talking, mediated by a human being,” says Prasad. It could alert other cars to the presence of rain, perhaps switching on headlights automatically or changing the assumptions that nearby cars make about road traction.
  4. Unique in the Crowd: The Privacy Bounds of Human Mobility (PDF, Nature) — We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals. As Edd observed, “You are a unique snowflake, after all.” (via Alasdair Allan)

Four short links: 29 March 2013

  1. Titan 0.3 Out — graph database now has full-text, geo, and numeric-range index backends.
  2. Mozilla Security Community Do a Reddit AMA — if you wanted a list of sharp web security people to follow on Twitter, you could do a lot worse than this.
  3. Probabilistic Programming and Bayesian Methods for Hackers (Github) — An introduction to Bayesian methods + probabilistic programming in data analysis with a computation/understanding-first, mathematics-second point of view. All in pure Python. See also Why Probabilistic Programming Matters and Trends to Watch: Logic and Probabilistic Programming. (via Mike Loukides and Renee DiRestra)
  4. Open Source 3D-Printable Optics Equipment (PLOSone) — This study demonstrates an open-source optical library, which significantly reduces the costs associated with much optical equipment, while also enabling relatively easily adapted customizable designs. The cost reductions in general are over 97%, with some components representing only 1% of the current commercial investment for optical products of similar function. The results of this study make its clear that this method of scientific hardware development enables a much broader audience to participate in optical experimentation both as research and teaching platforms than previous proprietary methods.

March 28 2013

How crowdfunding and the JOBS Act will shape open source companies

Currently, anyone can crowdfund products, projectscauses, and sometimes debt. Current U.S. Securities and Exchange Commission (SEC) regulations make crowdfunding companies (i.e. selling stocks rather than products on crowdfund platforms) illegal. The only way to sell stocks to the public at large under the current law is through the heavily regulated Initial Public Offering (IPO) process.

The JOBS Act will soon change these rules. This will mean that platforms like Kickstarter will be able to sell shares in companies, assuming those companies follow certain strict rules. This change in finance law will enable open source companies to access capital and dominate the technology industry. This is the dawn of crowdfunded finance, and with it comes the dawn of open source technology everywhere.

The JOBS Act is already law, and it required the SEC to create specific rules by specific deadlines. The SEC is working on the rulemaking, but it has made it clear that given the complexity of this new finance structure, meeting the deadlines is not achievable. No one is happy with the delay but the rules should be done in late 2013 or early 2014.

When those rules are addressed, thousands of open source companies will use this financial instrument to create new types of enterprise open source software, hardware, and bioware. These companies will be comfortably funded by their open source communities. Unlike traditional venture-capital-backed companies, these new companies will narrowly focus on getting the technology right and putting their communities first. Eventually, I think these companies will make most proprietary software companies obsolete.

How are companies like Oracle, Apple, Microsoft, SAS and Cisco able to make so much money in markets that have capable commercial open source competitors? In a word: capital. These companies have access to guaranteed cash flows from locked-in users of their products.

Therefore, venture capital investors are willing to provide startup capital to new business only when they demonstrate the capacity for new lock-in. Investors that start technology companies avoid investments that do not trap their user bases. That means entrenched proprietary players frequently face no serious threats from open source alternatives. The result? Lots of lock-in and lots of customers trapped in long-term relationships with proprietary companies that have little motivation to treat them fairly.

The only real argument against business models that respect software freedom have always been about access to capital. Startups are afraid to release using a FOSS license because that decision limits their access to investment. Early-stage investors love to hear the words “proprietary,” “patent-pending” and “trade secret,” which they mentally translate into “exit strategy.” For these investors, trapping users is a hedge against their inability to evaluate early-stage technology startups. (I am sympathetic; predicting the future is hard.) As a result, most successfully funded technology startups are either proprietary, patented platforms or software-as-a-service (SaaS) platforms.

This is all going to change.

Crowdfunded finance is going to shift the funding of software forever, and it is going to create a new class of tech organization: freedom-first technology companies.

Now, open source projects will be able to seek and find crowds of investors from within their own communities. These companies will have both the traditional advantages of proprietary companies (well-capitalized companies recruit armies of competent programmers and sales forces that can survive long sales cycles) and the advantages of the open source development model (open code review and the ability to integrate the insights of outsiders).

Yesterday, it was a big deal if you could get Intel to invest in your company. Tomorrow, you will seek funding from 500 Intel employees, who are all better qualified to vet your technology startup than 90% of the people in Intel’s investment arm. These crowdfunders are also willing to make a decision to invest in six hours rather than six months.

For this reason, I believe there will be a treasure trove of companies that will soon be born out of open source/libre software/hardware/bioware projects by asking their communities to crowdfund their initial rounds of financing. Large community projects will give birth to one or several different companies that are designed from the ground up to support those projects. GitHub and Thingiverse will become the new hubs for investors. Developers who have demonstrated competence in projects will be rewarded with access to financing that is both cheaper and faster than seed or angel funding.

With this fundamental change in incentive structures, open source projects will have the capital they need to try truly radical approaches to the design of their projects. Currently, open source projects have to choose between begging for capital or living without that capital. Many open source projects choose slow and gradual development not because they prefer it, but because this is what the developers involved can afford to do in their spare time. The Debian and Ubuntu projects are illustrative of the differences in style and result when the same community is “shoestringing it” versus having access to capital. The people running many open source projects know that no angel investor would touch them, so they make slow and steady progress to “good” software releases rather than rapid progress to “amazing” software releases.

These new freedom-first companies will be able to prioritize what is best for their projects and their communities without bearing the wrath of their investors. That’s because their communities are their investors.

This is not going to merely create a class of software that can rival current proprietary software vendors. In a sense, current commercial open source companies are already doing a fine job of that. But those open source companies typically have investors who are similarly desperate for hockey-stick returns. Even they must choose software development strategies that will pay off for investors. This new class of company will prefer technical strategies that will pay off for end users.

That might seem like a small distinction, but this incentive tweak will change everything about how software is made.

The new companies that leverage this funding option will look a lot like Canonical. Canonical is the kind of company you get when the geeks are fully in charge, and you have investors who are very tolerant of long-term risk because they grok the underlying technical problems that sometimes take decades to entirely resolve. Also, the investors probably know what the word “grok” means. But, unfortunately, there are only so many Mark Shuttleworth-types around (one as far as I know, but a guy who can get himself into space can probably be first in line for human cloning, too).

Shuttleworth is famous for reading printouts of the Debian mailing list on vacation as he figured out which Debian developers to hire for Canonical, the new Linux startup he was funding. That kind of behavior is not what most financial analysts do before making an investment, but this, and other similar efforts, allowed Shuttleworth to predict and control the future of a very technical financial opportunity. It is this kind of focus that allowed Shuttleworth to make one great investment and know that it would work, rather than making hundreds of investments hoping that one of them would work. Using the JOBS Act, community members that already sustain that level of research about an open source project can make the same kinds of bets, but with much less money. (It is ironic that so many of the critics of the JOBS Act presume that the crowd is ignorant rather than recognizing the potential for hyper-informed investor communities.)

In addition, companies like Canonical, Rackspace, Google, Amazon, and Red Hat might acquire these new companies. All of these established organizations can afford billion-dollar acquisitions, and they are either entirely open source or they are very open-source friendly. This kind of acquisition potential will ensure that once open source technology companies prove themselves, they will have access to series A and B financing. I also expect there will be several new open source mega-companies that emerge that are even more devoted to community and end-user experience than these current open source leaders.

This new class of company will have lots and lots of hockey sticks and plenty of billion-dollar exits. Companies will achieve these exits precisely because they do not focus on them (it’s very Zen). They will choose and execute visionary technical strategies that no outside investor could understand. These strategies will seem obvious to their communities of users/investors. These companies will be able to move into capitalization as soon as their communities are convinced the technical strategies and execution capabilities are sound. All of this will lead to better, faster, bigger open source stuff.


If you’d like to talk with other people who are interested in getting and giving funding for open source companies, I set up a related mailing list and Twitter account (@WeInvestInUs).

Related:

Four short links: 28 March 2013

  1. What American Startups Can Learn From the Cutthroat Chinese Software IndustryIt follows that the idea of “viral” or “organic” growth doesn’t exist in China. “User acquisition is all about media buys. Platform-to-platform in China is war, and it is fought viciously and bitterly. If you have a Gmail account and send an email to, for example, NetEase163.com, which is the local web dominant player, it will most likely go to spam or junk folders regardless of your settings. Just to get an email to go through to your inbox, the company sending the email needs to have a special partnership.” This entire article is a horror show.
  2. White House Hangout Maker Movement (Whitehouse) — During the Hangout, Tom Kalil will discuss the elements of an “all hands on deck” effort to promote Making, with participants including: Dale Dougherty, Founder and Publisher of MAKE; Tara Tiger Brown, Los Angeles Makerspace; Super Awesome Sylvia, Super Awesome Maker Show; Saul Griffith, Co-Founder, Otherlab; Venkatesh Prasad, Ford.
  3. Municipal Codes of DC Freed (BoingBoing) — more good work by Carl Malamud. He’s specifically providing data for apps.
  4. The Modern Malware Review (PDF) — 90% of fully undetected malware was delivered via web-browsing; It took antivirus vendors 4 times as long to detect malware from web-based applications as opposed to email (20 days for web, 5 days for email); FTP was observed to be exceptionally high-risk.
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl