Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 10 2014

Four short links: 10 February 2014

  1. Bruce Sterling at transmediale 2014 (YouTube) — “if it works, it’s already obsolete.” Sterling does a great job of capturing the current time: spies in your Internet, lost trust with the BigCos, the impermanence of status quo, the need to create. (via BoingBoing)
  2. No-one Should Fork Android (Ars Technica) — this article is bang on. Google Mobile Services (the Play functionality) is closed-source, what makes Android more than a bare-metal OS, and is where G is focusing its development. Google’s Android team treats openness like a bug and routes around it.
  3. Data Pipelines (Hakkalabs) — interesting overview of the data pipelines of Stripe, Tapad, Etsy, and Square.
  4. Visualising Salesforce Data in Minecraft — would almost make me look forward to using Salesforce. Almost.

February 05 2014

Four short links: 6 February 2014

  1. What Machines Can’t Do (NY Times) — In the 1950s, the bureaucracy was the computer. People were organized into technocratic systems in order to perform routinized information processing. But now the computer is the computer. The role of the human is not to be dispassionate, depersonalized or neutral. It is precisely the emotive traits that are rewarded: the voracious lust for understanding, the enthusiasm for work, the ability to grasp the gist, the empathetic sensitivity to what will attract attention and linger in the mind. Cf the fantastic The Most Human Human. (via Jim Stogdill)
  2. The Technium: A Conversation with Kevin Kelly (Edge) — If we were sent back with a time machine, even 20 years, and reported to people what we have right now and describe what we were going to get in this device in our pocket—we’d have this free encyclopedia, and we’d have street maps to most of the cities of the world, and we’d have box scores in real time and stock quotes and weather reports, PDFs for every manual in the world—we’d make this very, very, very long list of things that we would say we would have and we get on this device in our pocket, and then we would tell them that most of this content was free. You would simply be declared insane. They would say there is no economic model to make this. What is the economics of this? It doesn’t make any sense, and it seems far-fetched and nearly impossible. But the next twenty years are going to make this last twenty years just pale. (via Sara Winge)
  3. Applying Machine Learning to Network Security Monitoring (Slideshare) — interesting deck on big data + machine learning as applied to netsec. See also their ML Sec Project. (via Anton Chuvakin)
  4. Medieval Unicode Font Initiative — code points for medieval markup. I would have put money on Ogonek being a fantasy warrior race. Go figure.

December 31 2013

Four short links: 31 December 2013

  1. Toyota Manufacturing Principles (Joseph Cohen) — Jidoka: Automation with a Human Touch. The idea of jidoka is that humans should work with machines to produce the best possible outcome, leveraging the execution ability of a machine and the judgement of a human. We at O’R Radar have been saying for years that there’s gold in the collaboration between people and machines, about augmenting people and not simply replacing them.
  2. Twisterthe fully decentralized P2P microblogging platform leveraging from the free software implementations of Bitcoin and BitTorrent protocols. Interesting to see BT and BC reused as platforms for app development, though if eventual consistency and threading Heisenbugs gave you headaches then just wait for the world of Bitcoin-meets-BitTorrent….
  3. Free Uncopyrighted NDA and Employment Contracts — CC0′d legalware.
  4. Transcript of Glenn Greenwald’s Speech to CCC — the relationship of privacy to security, and the transparency of governmental positions on that relationship, remain unaddressed. NSA’s actions are being used to establish local governmental control of the Internet, which will destroy the multistakeholder model that has kept net architecture and policy largely separate from the whims of elected officials. The fallout of Snowden’s revelations will shape 2014. Happy New Year.

December 20 2013

Four short links: 20 December 2013

  1. A History of the Future in 100 Objects — is out! It’s design fiction, describing the future of technology in faux Wired-like product writeups. Amazon already beating the timeline.
  2. Projects and Priorities Without Managers (Ryan Carson) — love what he’s doing with Treehouse. Very Googley. The more I read about these low-touch systems, the more obviously important self-reporting is. It is vital that everyone posts daily updates on what they’re working on or this whole idea will fall down.
  3. Intellectual Ventures Patent Collection — astonishing collection, ready to be sliced and diced in Cambia’s Lens tool. See the accompanying blog post for charts, graphs, and explanation of where the data came from.
  4. Smokio Electronic Cigarette — the quantified cigarette (not yet announced) for measuring your (electronic) cigarette consumption and uploading the data (natch) to your smartphone. Soon your cigarette will have an IPv6 address, a bluetooth connection, and firmware to be pwned.

December 18 2013

Four short links: 18 December 2013

  1. Cyberpunk 2013 — a roleplaying game shows a Gibsonian view of 2013 from 1988. (via Ben Hammersley)
  2. The Future Computer Utility — 1967 prediction of the current state. There are several reasons why some form of regulation may be required. Consider one of the more dramatic ones, that of privacy and freedom from tampering. Highly sensitive personal and important business information will be stored in many of the contemplated systems. Information will be exchanged over easy-to-tap telephone lines. At best, nothing more than trust—or, at best, a lack of technical sophistication—stands in the way of a would-be eavesdropper. All data flow over the lines of the commercial telephone system. Hanky-panky by an imaginative computer designer, operator, technician, communications worker, or programmer could have disastrous consequences. As time-shared computers come into wider use, and hold more sensitive information, these problems can only increase. Today we lack the mechanisms to insure adequate safeguards. Because of the difficulty in rebuilding complex systems to incorporate safeguards at a later date, it appears desirable to anticipate these problems. (via New Yorker)
  3. Lantronix XPort Pro Lx6a secure embedded device server supporting IPv6, that barely larger than an RJ45 connector. The device runs Linux or the company’s Evolution OS, and is destined to be used in wired industrial IoT / M2M applications.
  4. Pond — interesting post-NSA experiment in forward secure, asynchronous messaging for the discerning. Pond messages are asynchronous, but are not a record; they expire automatically a week after they are received. Pond seeks to prevent leaking traffic information against everyone except a global passive attacker. (via Morgan Mayhem)

November 20 2013

Four short links: 20 November 2013

  1. Innovation and the Coming Shape of Social Transformation (Techonomy) — great interview with Tim O’Reilly and Max Levchin. in electronics and in our devices, we’re getting more and more a sense of how to fix things, where they break. And yet as a culture, what we have chosen to do is to make those devices more disposable, not last forever. And why do you think it will be different with people? To me one of the real risks is, yes, we get this technology of life extension, and it’s reserved for a very few, very rich people, and everybody else becomes more disposable.
  2. Attending a Conference via a Telepresence Robot (IEEE) — interesting idea, and I look forward to giving it a try. The mark of success for the idea, alas, is two bots facing each other having a conversation.
  3. Drone Imagery for OpenStreetMap — 100 acres of 4cm/pixel imagery, in less than an hour.
  4. LG Smart TV Phones Home with Shows and Played Files — welcome to the Internet of Manufacturer Malware.

October 18 2013

Four short links: 18 October 2013

  1. Science Not as Self-Correcting As It Thinks (Economist) — REALLY good discussion of the shortcomings in statistical practice by scientists, peer-review failures, and the complexities of experimental procedure and fuzziness of what reproducibility might actually mean.
  2. Reproducibility Initiative Receives Grant to Validate Landmark Cancer StudiesThe key experimental findings from each cancer study will be replicated by experts from the Science Exchange network according to best practices for replication established by the Center for Open Science through the Center’s Open Science Framework, and the impact of the replications will be tracked on Mendeley’s research analytics platform. All of the ultimate publications and data will be freely available online, providing the first publicly available complete dataset of replicated biomedical research and representing a major advancement in the study of reproducibility of research.
  3. $20 SDR Police Scanner — using software-defined radio to listen to the police band.
  4. Reimagine the Chemistry Set — $50k prize in contest to design a “chemistry set” type kit that will engage kids as young as 8 and inspire people who are 88. We’re looking for ideas that encourage kids to explore, create, build and question. We’re looking for ideas that honor kids’ curiosity about how things work. Backed by the Moore Foundation and Society for Science and the Public.

September 05 2013

Four short links: 6 September 2013

  1. In Search of the Optimal Cheeseburger (Hilary Mason) — playing with NYC menu data. There are 5,247 cheeseburgers you can order in Manhattan. Her Ignite talk from Ignite NYC15.
  2. James Burke Predicting the Future — spoiler: massive disruption from nano-scale personal fabbing.
  3. Stanford Javascript Crypto Librarya project by the Stanford Computer Security Lab to build a secure, powerful, fast, small, easy-to-use, cross-browser library for cryptography in Javascript.
  4. The STEM Crisis is a Myth (IEEE Spectrum) — Every year U.S. schools grant more STEM degrees than there are available jobs. When you factor in H-1B visa holders, existing STEM degree holders, and the like, it’s hard to make a case that there’s a STEM labor shortage.

August 07 2013

Predicting the future: Strata 2014 hot topics

Conferences like Strata are planned a year in advance. The logistics and coordination required for an event of this magnitude takes a lot of planning, but it also takes a decent amount of prediction: Strata needs to skate to where the puck is going.

While Strata New York + Hadoop World 2013 is still a few months away, we’re already guessing at what next year’s Santa Clara event will hold. Recently, the team got together to identify some of the hot topics in big data, ubiquitous computing, and new interfaces. We selected eleven big topics for deeper investigation.

  • Deep learning
  • Time-series data
  • The big data “app stack”
  • Cultural barriers to change
  • Design patterns
  • Laggards and Luddites
  • The convergence of two databases
  • The other stacks
  • Mobile data
  • The analytic life-cycle
  • Data anthropology

Here’s a bit more detail on each of them.

Deep learning

Teaching machines to think has been a dream/nightmare of scientists for a long time. Rather than teaching a machine explicitly, or using Watson-like statistics to figure out the best answer from a mass of data, Deep Learning uses simpler, core ideas and then builds upon them — much as a baby learns sounds, then words, then sentences.

It’s been applied to problems like vision (find an edge, then a shape, then an object) and better voice recognition. But advances in processing and algorithms are making it increasingly attractive for a large number of challenges. A Deep Learning model “copes” better with things its creators can’t foresee, or genuinely new situations. A recent MIT Technology Review article said these approaches improved image recognition by 70%, and improved Android voice recognition 25%. But 80% of the benefits come from additional computing power, not algorithms, so this is stuff that’s only become possible with the advent of cheap, on-demand, highly parallel processing.

The main drivers of this approach are big companies like Google (which acquired DNNResearch), IBM and Microsoft. There are also startups in the machine learning space like Vicarious and Grok (née Numenta).

Deep Learning isn’t without its critics. Something learned in a moment of pain or danger might not be true later on, so the system needs to unlearn — or at least reduce the certainty — of a conclusion. What’s more, certain things might only be true after a sequence of events: once we’ve seen a person put a ball in a box and close the lid, we know there is a ball in the box, but a picture of the box afterward wouldn’t reveal this. Inability to take into account time is one of the criticisms Grok founder Jeff Hawkins levels at Deep Learning.

There’s some good debate, and real progress in AI and machine learning, as a result of the new computing systems that make these models possible. They’ll likely supplant the expert systems (yes/no trees) that are used in many industries, but have fundamental flaws. Ben Goldacre described this problem at Strata in 2012: almost every patient who displays the symptoms of a rare disease instead has two, much more common, diseases with those symptoms.

Also this is why House is a terrible doctor’s show.

In 2014, much of the data science content of Strata will focus on making machines smarter, and much of this will come from abundant back-end processing paired with advances in vision, sensemaking, and context.

Time-series data

Data is often structured according to the way it will be used.

  • To data designers, a graph is a mathematical structure that describes how a pair of objects relate to one another. This is why Facebook’s search tool is called Graph Search. To work with large numbers of relationships, we use a Graph database that organizes everything in it according to how it’s related to everything else. This makes it easy to find things that are linked to one another, like routers in a network or friends at a company, even with millions of connections. As a result, it’s often in the core of a social network’s application stack. Companies like Neo4j and Titan and Vertex make them.
  • On the other hand, a relational database keeps several tables of data (your name; a product purchase) and then links them by a common thread (such as the credit card used to buy the product, or the name of the person to whom it belongs). When most traditional enterprise IT people say “database,” they mean a relational database (RDBMS). The RDBMS has been so successful it’s supplanted most other forms of data storage.

(As a sidenote, at the core of the RDBMS is a “join,” an operation that links two tables. Much of the excitement around NoSQL databases was in fact about doing away with the join, which — though powerful — significantly restricts how quickly and efficiently an RDBMS can process large amounts of data. Ironically, the dominant language for querying many of these NoSQL databases through tools like Impala is now SQL. If the NoSQL movement had instead been called NoJoin, things might have been much more clear.)

Book SpiralBook Spiral

Book Spiral – Seattle Central Library by brewbooks, on Flickr

Data systems are often optimized for a specific use.

  • Think of a coin-sorting machine — it’s really good at organizing many coins of a limited variety (nickels, dimes, pennies, etc.).
  • Now think of a library — it’s really good at a huge diversity of books, often only one or two of each, and not very fast.

Databases are the same: a graph database is built differently from a relational database; an analytical database (used to explore and report on data) is different from an operational one (used in production).

Most of the data in your life — from your Facebook feed to your bank statement — has one common element: time. Time is the primary key of the universe.

Since time is often the common thread in data, optimizing databases and processing systems to be really, really good at handling data over time is a huge benefit for many applications, particularly those that try to find correlations between seemingly different data — does the temperature on your NEST thermostat correlate with an increase in asthma inhaler use? Black Swans aside, time is also useful when trying to predict the future from the past.

Time Series data is at the root of life-logging and the Quantified Self movement, and will be critical for the Internet of Things. It’s a natural way to organize things which, as humans, we fundamentally understand. Time series databases have a long history, and there’s a lot of effort underway to modernize them as well as the analytical tools that crunch the data they contain, so we think time-series data deserves deeper study in 2014.

The Big Data app stack

We think we’re about to see the rise of application suites for big data. Consider the following evolution:

  1. On a mainframe, the hardware, operating system, and applications were often indistinguishable.
  2. Much of the growth of consumer PCs happened because of the separation of these pieces — companies like Intel and Phoenix made the hardware; Microsoft and Red Hat made the OS; and developers like WordPerfect, Lotus, and DBase made the applications.
  3. Eventually, we figured out what the PC was “for” and it acquired a core set of applications without which, it seems, a PC wouldn’t be useful. Those are generally described as “office suites,” and while there was once a rivalry for them, today, they’ve been subsumed by OS makers (Apple, Microsoft, Open Source) while those that didn’t have an OS withered on the vine (Corel).
  4. As we moved onto the web, the same thing started to happen — email, social network, blog, and calendar seemed to be useful online applications now that we were all connected, and the big portal makers like Google, Sina, Yahoo, Naver, and Facebook made “suites” of these things. So, too, did the smartphone platforms, from PalmPilot to Blackberry to Apple and Android.
  5. Today’s private cloud platforms are like yesterday’s operating systems, with OpenStack, CloudPlatform, VMWare, Eucalyptus, and a few others competing based on their compatibility with public clouds, hardware, and applications. Clouds are just going through this transition to apps, and we’re learning that their “app suite” includes things like virtual desktops, disaster recovery, on-demand storage — and of course, big data.

Okay, enough history lesson.

We’re seeing similar patterns emerge in big data. But it’s harder to see what the application suite is before it happens. In 2014, we think we’ll be asking ourselves, what’s the Microsoft Office of Big Data? We can make some guesses:

  • Predicting the future
  • Deciding what people or things are related to other people or things
  • Helping to power augmented reality tools like Google Glass with smart context
  • Making recommendations by guessing what products will appeal to which customers
  • Optimizing bottlenecks in supply chains or processes
  • Identifying health risks or anomalies worthy of investigation

Companies like Wibidata are trying to figure this out — and getting backed by investors with deep pockets. Just as most of the interesting stories about operating systems were the apps that ran on them, and the stories about clouds are things like big data, so the good stories about big data are the “office suites” atop it. Put another way, we don’t know yet what big data is for, but I suspect that in 2014 we’ll start to find out.

Cultural barriers to data-driven change

Every time I talk with companies about data, they love the concept but fail on the execution. There are a number of reasons for this:

  • Incumbency. Yesterday’s leaders were those who could convince others to act in the absence of information. Tomorrow’s leaders are those who can ask the right questions. This means there is a lot of resistance from yesterday’s leaders (think Moneyball).
  • Lack of empowerment. I recently ate a meal in the Pittsburgh airport, and my bill came with a purple pen. I’m now wondering if I tipped differently because of that. What ink colour maximizes per-cover revenues in an airport restaurant? (Admittedly, I’m a bit obsessive.) But there’s no reason someone couldn’t run that experiment, and increase revenues. Are they empowered to do so? How would they capture the data? What would they deem a success? These are cultural and organizational questions that need to be tackled by the company if it is to become data-driven.
  • Risk aversion. Steve Blank says a startup is an organization designed to search for a scalable, repeatable business model. Here’s a corollary: a big company is one designed to perpetuate a scalable, repeatable business model. Change is not in its DNA — predictability is. Since the days of Daniel McCallum, organizational charts and processes fundamentally reinforce the current way of doing things. It often takes a crisis (such as German jet planes in World War Two or Netscape’s attack on Microsoft) to evoke a response (the Lockheed Martin Skunk Works or a free web browser).
  • Improper understanding. Correlation is not causality — there is a correlation between ice cream and drowning, but that doesn’t mean we should ban ice cream. Both are caused by summertime. We should hire more lifeguards (and stock up on ice cream!) in the summer. Yet many people don’t distinguish between correlation and causality. As a species, humans are wired to find patterns everywhere  because a false positive (turning when we hear a rustle in the bushes, only to find there’s nothing there) is less dangerous than a false negative (not turning and getting eaten by a sabre-toothed tiger).
  • Focus on the wrong data. Lean Analytics urges founders to be more data-driven and less self-delusional. But when I recently spoke with executives from DHL’s innovation group, they said that innovation in a big company requires a wilful disregard for data. That’s because the preponderance of data in a big company reinforces the status quo; nascent, disruptive ideas don’t stand a chance. Big organizations have all the evidence they need to keep doing what they have always done.

There are plenty of other reasons why big organizations have a hard time embracing data. Companies like IBM, CGI, and Accenture are minting money trying to help incumbent organizations be the next Netflix and not the next Blockbuster.

What’s more, the advent of clouds, social media, and tools like PayPal or the App Store has destroyed many of the barriers to entry on which big companies rely. As Quentin Hardy pointed out in a recent article, fewer and fewer big firms stick around for the long haul.

Design patterns

As any conference matures, we move into best practices. The way these manifest themselves with architecture is in the form of proven architectures — snippets of recipes people can re-use. Just as a baker knows how to make an icing sauce with fat and sugar — and can adjust it to make myriad variations — so, too, can an architect use a particular architecture to build a known, working component or service.

As Mike Loukides points out, a design pattern is even more abstract than a recipe. It’s like saying, “sweet bread with topping,” which can then be instantiated in any number of different kinds of cake recipes. So, we have a design pattern for “highly available storage” and then rely on proven architectural recipes such as load-balancing, geographic redundancy, and eventual consistency to achieve it.

Such recipes are well understood in computing, and they eventually become standards and appliances. We have a “scale-out” architecture for web computing, where many cheap computers can handle a task, as an Application Delivery Controller (a load balancer) “sprays” traffic across those machines. It’s common wisdom today. But once, it was innovative. Same thing with password recovery mechanisms and hundreds of other building blocks.

We’ll see these building blocks emerge for data systems that meet specific needs. For example, a new technology called homomorphic encryption allows us to analyze data while it is still encrypted, without actually seeing the data. That would, for example, allow us to measure the spread of a disease without violating the privacy of the individual patients. (We had a presenter talk about this at DDBD in Santa Clara.) This will eventually become a vital ingredient in a recipe for “data where privacy is maintained.” There will be other recipes optimized for speed, or resiliency, or cost, all in service of the “highly available storage” pattern.

This is how we move beyond vendors. Just as a scale-out web infrastructure can have an ADC from Radware, Citrix, F5, Riverbed, Cisco, and others (with the same pattern), we’ll see design patterns for big data with components that could come from Cloudera, Hortonworks, IBM, Intel, MapR, Oracle, Microsoft, Google, Amazon, Rackspace, Teradata, and hundreds of others.

Note that many vendors who want to sell “software suites” will hate this. Just as stereo vendors tried to sell all-in-one audio systems, which ultimately weren’t very good, many of today’s commercial providers want to sell turnkey systems that don’t allow the replacement of components. Design patterns and the architectures on which they rely are anathema to these closed systems — and are often where standards tracks emerge. 2014 is when that debate will start out in Big Data.

Laggards and Luddites

Certain industries are inherently risk-averse, or not technological. But that changes fast. A few years ago, I was helping a company called FarmsReach connect restaurants to local farmers and turn the public market into a supply chain hub. We spent a ton of effort building a fax gateway because farmers didn’t have mobile phones, and ultimately, the company pivoted to focus on building networks between farmers.

Today, however, farmers are adopting tech quickly, and they rely on things like GPS-based tractor routing and seed sowing (known as “Satellite Farming”) to get the most from their fields.

As the cost of big data drops and the ease of use increases, we’ll see it applied in many other places. Consider, for example, a city that can’t handle waste disposal. Traditionally, the city would buy more garbage trucks and hire more garbage collectors. But now, it can analyze routing and find places to optimize collection. Unfortunately, this requires increased tracking of workers — something the unions will resist very vocally. We already saw this in education, where efforts to track students were shut down by teachers’ unions.

In 2014, big data will be crossing the chasm, welcoming late adopters and critics to the conversation. It’ll mean broadening the scope of the discussion — and addressing newfound skepticism — at Strata.

Convergence of two databases

If you’re running a data-driven product today, you typically have two parallel systems.

  • One’s in production. If you’re an online retailer, this is where the shopping cart and its contents live, or where the user’s shipping address is stored.
  • The other’s used for analysis. An online retailer might make queries to find out what someone bought in order to handle a customer complaint or generate a report to see which products are selling best.

Analytical technology comes from companies like Teradata, IBM (from the Cognos acquisition), Oracle (from the Hyperion acquisition), SAP, and independent Microstrategy, among many others. They use words like “Data Warehouse” to describe these products, and they’ve been making them for decades. Data analysts work with them, running queries and sending reports to corporate bosses. A standalone analytical data warehouse is commonly accepted wisdom in enterprise IT.

But those data warehouses are getting faster and faster. Rather than running a report and getting it a day later, analysts can explore the data in real time — re-sorting it by some dimension, filtering it in some way, and drilling down. This is often called pivoting, and if you’ve ever used a Pivot Table in Excel you know what it’s like. In data warehouses, however, we’re dealing with millions of rows.

At the same time, operational databases are getting faster and sneakier. Traditionally, a database is the bottleneck in an application because it doesn’t handle concurrency well. If a record is being changed in the database by one person, it’s locked so nobody else can touch it. If I am editing a Word document, it makes sense to lock it so someone else doesn’t edit it — after all, what would we do with the changes we’d both made?

But that model wouldn’t work for Facebook or Twitter. Imagine a world where, when you’re updating your status, all your friends can’t refresh their feeds.

We’ve found ways to fix this. When several people edit a Google Doc at once, for instance, each of their changes is made as a series of small transactions. The document doesn’t really exist — instead, it’s a series of transactional updates, assembled to look like a document. Similarly, when you post something to Facebook, those changes eventually find their way to your friends. The same is true on Twitter or Google+.

These kinds of eventually consistent approaches make concurrent editing possible. They aren’t really new, either: your bank statement is eventually consistent, and when you check it online, the bottom of the statement tells you that the balance is only valid up until a period in the past and new transactions may take a while to post. Here’s what mine says:

Transactions from today are reflected in your balance, but may not be displayed on this page if you recently updated your bankbook, if a paper statement was issued, or if a transaction is backdated. These transactions will appear in your history the following business day.

Clearly, if eventual consistency is good enough for my bank account, it’s good enough for some forms of enterprise data.

So, we have analytical databases getting real-time fast and operational databases increasingly able to do things concurrently without affecting production systems. Which begs the question: why do we have two databases?

This is a massive, controversial issue worth billions of dollars. Take, for example, EMC, which recently merged its Greenplum acquisition into Pivotal. Pivotal’s marketing (“help customers build, deploy, scale, and analyze at an unprecedented velocity”) points at this convergence, which may happen as organizations move their applications into cloud environments (which is partly why Pivotal includes Cloud Foundry, which VMWare acquired).

The change will probably create some huge industry consolidation in the coming years (think Oracle buying Teradata, then selling a unified operational/analytical database). There are plenty of reasons it’s a bad idea, and plenty of reasons why it’s a good one. We think this will be a hot topic in 2014.

Cassandra and the other stacks

Big data has been synonymous with Hadoop. The break-out success of the Hadoop ecosystem has been astonishing, but it does other stacks a disservice. There are plenty of other robust data architectures that have furiously enthusiastic tribes behind them. Cassandra, for example, was created by Facebook, released into the wild, and tamed by Reddit to allow the site to scale to millions of daily visitors atop Amazon with only a handful of employees. MongoDB is another great example, and there are dozens more.

Some of these stacks got wrapped around the axle of the NoSQL debate, which, as I mentioned, might have been better framed as NoJoin. But we’re past that now, and there are strong case studies for many of the stacks. There are also proven affinities between a particular stack (such as Cassandra) and a particular cloud (such as Amazon Web Services) because of their various heritages.

In 2014, we’ll be discussing more abstract topics and regarding every one of these stacks as a tool in a good toolbox.

Mobile data

By next year, there will be more mobile phones in the world than there are humans, over one billion of them “smart.” They are the closest thing we have to a tag for people. Whether measuring mall traffic for shoppers or projecting the source of Malarial outbreaks in Africa, it’s big. One carrier recently released mobile data from the Ivory Coast to researchers.

Just as Time Series data has structure, so does geographic data, much of which lives in Strata’s Connected World track. Mobile data is a precursor to the Internet of Everything, and it’s certainly one of the most prolific structured data sources in the world.

I think concentrating on mobility is critical for another reason, too. The large systems created to handle traffic for the nearly 1,000 carriers in the world are big, fast, and rock solid. An AT&T 5ESS switch or one of the large-scale Operational Support Systems, simply does not fall over.

Other than DNS, the Internet doesn’t really have this kind of industrial-grade system for managing billions of devices, each of which can connect to the others with just a single address. That is astonishing scale, and we tend to ignore it as “plumbing.” In 2014 , the control systems for the Internet of Everything are as likely to come from Big Iron made by Ericsson as they are to come from some Web 2.0 titan.

The analytic life-cycle

The book The Theory That Would Not Die begins with a quote from John Maynard Keynes: “When the facts change, I change my opinion. What do you do, sir?” As this New York Times review of the book observes, “If you are not thinking like a Bayesian, perhaps you should be.”

Bayes’ theorem says that beliefs must be updated based on new evidence — and in an information-saturated world, new evidence arrives constantly, which means the cycle turns quickly. To many readers, this is nothing more than explaining the scientific method. But there are plenty of people who weren’t weaned on experimentation and continuous learning — and even those with a background in science make dumb mistakes, as the Boy Or Girl Paradox handily demonstrates.

Ben Lorica, O’Reilly’s chief scientist (and owner of the enviable Twitter handle @BigData) recently wrote about the lifecycle of data analysis. I wrote another piece on the Lean Analytics cycle with Avinash Kaushik a few months ago. In both cases, it’s an iterative process of hypothesis-forming, experimentation, data collection, and readjustment.

In 2014, we’ll be spending more time looking at the whole cycle of data analysis, including collection, storage, interpretation, and the practice of asking good questions informed by new evidence.

Data anthropology

Data seldom tells the whole story. After flooding in Haiti, mobile phone data suggested people weren’t leaving one affected area for a safe haven. Researchers concluded that they were all sick with cholera and couldn’t move. But by interviewing people on the ground, aid workers found out the real problem was that flooding had destroyed the roads, making it hard to leave.

As this example shows, there’s no substitute for context. In Lean Analytics, we say “Instincts are experiments. Data is proof.” For some reason this resonated hugely and is one of the most favorited/highlighted passages in the book. People want a blend of human and machine, of soft, squishy qualitative data alongside cold, hard quantitative data. We joke that in the early stages of a startup, your only metric should be “how many people have I spoken with?” It’s too early to start counting.

In Ash Maurya’s Running Lean, there’s a lot said about customer development. Learning how to conduct good interviews that don’t lead the witness and measuring the cultural factors that can pollute data is hugely difficult. In The Righteous MindJonathan Haidt says all university research is WEIRD: Western, Educated, Industrialized, Rich, and Democratic. That’s because test subjects are most often undergraduates, who fit this bill. To prove his assertion, Haidt replicated studies done on campus at a McDonald’s a few miles away, with vastly different results.

At the first Strata New York, I actually left the main room one morning to go write a blog post. I was so overcome by the examples of data errors — from bad collection, to bad analysis, to wilfully ignoring the results of good data — that it seemed to me we weren’t paying attention to the right things. If “Data is the new Oil,” then its supply chain is a controversial XL pipeline with woefully few people looking for leaks and faults. Anthropology can fix this, tying quantitative assumptions to verification.

Nobody has championed data anthropology as much as O’Reilly’s own Roger Magoulas, who joined Jon Bruner and Jim Stogdill for a podcast on the subject recently.

So, data anthropology can ensure good data collection, provide essential context to data, and check that the resulting knowledge is producing the intended results. That’s why it’s on our list of hot topics for 2014.

Photo: Book Spiral – Seattle Central Library by brewbooks, on Flickr

July 17 2013

Four short links: 17 July 2013

  1. Hideout — augmented reality books. (via Hacker News)
  2. Patterns and Practices for Open Source Software Success (Stephen Walli) — Successful FOSS projects grow their communities outward to drive contribution to the core project. To build that community, a project needs to develop three onramps for software users, developers, and contributors, and ultimately commercial contributors.
  3. How to Act on LKML — Linus’s tantrums are called out by one of the kernel developers in a clear and positive way.
  4. Beyond the Coming Age of Networked Matter (BoingBoing) — Bruce Sterling’s speculative short story, written for the Institute For The Future. “Stephen Wolfram was right about everything. Wolfram is the greatest physicist since Isaac Newton. Since Plato, even. Our meager, blind physics is just a subset of Wolfram’s new-kind-of- science metaphysics. He deserves fifty Nobels.” “How many people have read that Wolfram book?” I asked him. “I hear that his book is, like, huge, cranky, occult, and it drives readers mad.” “I read the forbidden book,” said Crawferd.

June 28 2013

Four short links: 28 June 2013

  1. Huxley vs Orwellbuy Amusing Ourselves to Death if this rings true. The future is here, it’s just not evenly surveilled. (via rone)
  2. KeyMe — keys in the cloud. (Digital designs as backups for physical objects)
  3. Motorola Advanced Technology and Products GroupThe philosophy behind Motorola ATAP is to create an organization with the same level of appetite for technology advancement as DARPA, but with a consumer focus. It is a pretty interesting place to be. And they hired the excellent Johnny Chung Lee.
  4. Internet Credit Union — Internet Archive starts a Credit Union. Can’t wait to see memes on debit cards.

June 06 2013

Four short links: 6 June 2013

  1. ShareFest — peer-to-peer file sharing in the browser. Source on GitHub. (via Andy Baio)
  2. Media for Thinking the Unthinkable (Bret Victor) — “Right now, today, we can’t see the thing, at all, that’s going to be the most important 100 years from now.” We cannot see the thing. At all. But whatever that thing is — people will have to think it. And we can, right now, today, prepare powerful ways of thinking for these people. We can build the tools that make it possible to think that thing. (via Matt Jones)
  3. McKinsey Report on Disruptive Technologies (McKinsey) — the list: Mobile Internet; Automation of knowledge work; Internet of Things; Cloud technology; Advanced Robotics; Autonomous and near-autonomous vehicles; Next-generation genomics; Energy storage; 3D Printing; Advanced Materials; Advanced Oil and Gas exploration and recovery; Renewable energy.
  4. The Only Public Transcript of the Bradley Manning Trial Will be Produced on a Crowd-Funded Typewriter[t]he fact that a volunteer stenographer is providing the only comprehensive source of information about such a monumental event is pretty absurd.

May 25 2013

Ep. 300: What We’ve Learned in Almost 7 Years

We created Astronomy Cast to be timeless, a listening experience that’s as educational in the future as it was when we started recording. But obviously, things have changed in almost 7 years and 300 episodes. Today we’ll give you an update on some of the big topics in space and astronomy. What did we know back then, and what additional stuff do we know now?

Show Notes

April 19 2013

Four short links: 19 April 2013

  1. Bruce Sterling on DisruptionIf more computation, and more networking, was going to make the world prosperous, we’d be living in a prosperous world. And we’re not. Obviously we’re living in a Depression. Slow first 25% but then it takes fire and burns with the heat of a thousand Sun Microsystems flaming out. You must read this now.
  2. The Matasano Crypto Challenges (Maciej Ceglowski) — To my delight, though, I was able to get through the entire sequence. It took diligence, coffee, and a lot of graph paper, but the problems were tractable. And having completed them, I’ve become convinced that anyone whose job it is to run a production website should try them, particularly if you have no experience with application security. Since the challenges aren’t really documented anywhere, I wanted to describe what they’re like in the hopes of persuading busy people to take the plunge.
  3. Tachyona fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Berkeley-licensed open source.
  4. Jammit (GitHub) — an industrial strength asset packaging library for Rails, providing both the CSS and JavaScript concatenation and compression that you’d expect, as well as YUI Compressor, Closure Compiler, and UglifyJS compatibility, ahead-of-time gzipping, built-in JavaScript template support, and optional Data-URI / MHTML image and font embedding. (via Joseph Misiti)

January 30 2013

January 22 2013

Four short links: 22 January 2013

  1. Design Like Nobody’s Patenting Anything (Wired) — profile of Maker favourites Sparkfun. Instead of relying on patents for protection, the team prefers to outrace other entrants in the field. “The open source model just forces us to innovate,” says Boudreaux. “When we release something, we’ve got to be thinking about the next rev. We’re doing engineering and innovating and it’s what we wanna be doing and what we do well.”
  2. Agree to Agree — why I respect my friend David Wheeler: his Design Scene app, which features daily design inspiration, obtains prior written permission to feature the sites because doing so is not only making things legally crystal clear, but also makes his intentions clear to the sites he’s linking to. He’s shared the simple license they request.
  3. The Coming Fight Between Druids and Engineers (The Edge) — We live in a time when the loneliest place in any debate is the middle, and the argument over technology’s role in our future is no exception. The relentless onslaught of novelties technological and otherwise is tilting individuals and institutions alike towards becoming Engineers or Druids. It is a pressure we must resist, for to be either a Druid or an Engineer is to be a fool. Druids can’t revive the past, and Engineers cannot build technologies that do not carry hidden trouble. (via Beta Knowledge)
  4. Reimagining Math Textbooks (Dan Meyer) — love this outline of how a textbook could meaningfully interact with students, rather than being recorded lectures or PDF versions of cyclostyled notes and multichoice tests. Rather than using a generic example to illustrate a mathematical concept, we use the example you created. We talk about its perimeter. We talk about its area. The diagrams in the margins change. The text in the textbook changes. Check it out — they actually built it!

January 04 2013

January 01 2013

Four short links: 1 January 2013

  1. Robots Will Take Our Jobs (Wired) — I agree with Kevin Kelly that (in my words) software and hardware are eating wetware, but disagree that This is not a race against the machines. If we race against them, we lose. This is a race with the machines. You’ll be paid in the future based on how well you work with robots. Ninety percent of your coworkers will be unseen machines. Most of what you do will not be possible without them. And there will be a blurry line between what you do and what they do. You might no longer think of it as a job, at least at first, because anything that seems like drudgery will be done by robots. Civilizations which depend on specialization reward work and penalize idleness. We already have more people than work for them, and if we’re not to be creating a vast disconnected former workforce then we (society) need to get a hell of a lot better at creating jobs and not destroying them.
  2. Why Workers are Losing the War Against Machines (The Atlantic) — There is no economic law that says that everyone, or even most people, automatically benefit from technological progress.
  3. Early Quora Design Notes — I love reading post-mortems and learning from what other people did. Picking a starting point is important because it will be the axis the rest of the design revolves around — but it’s tricky and not always the first page in the flow. Ideally, you should start with the page that serves the most significant goals of the product.
  4. Free Data Science BooksI don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.” (via Stein Debrouwere)

December 31 2012

Four short links: 31 December 2012

  1. Wireless Substitution (BoingBoing, CDC) — very nice graph showing the decline in landlines/growth in wireless.
  2. Maker’s RowOur mission is to make the manufacturing process simple to understand and easy to access. From large corporations to first time designers, we are providing unparalleled access to industry-specific factories and suppliers across the United States.
  3. mySight (GitHub) — myspectral.com Spectruino analyzer for light spectra in UV/VIS/NIR.
  4. State of the World (Bruce Sterling, John Lebkowsky) — always a delight. Come 2013, I think it’s time for people in and around the “music industry” to stop blaming themselves, and thinking their situation is somehow special. Whatever happens to musicians will eventually happen to everybody. Nobody was or is really much better at “digital transition” than musicians were and are. If you’re superb at digitalization, that’s no great solution either. You just have to auto-disrupt and re-invent yourself over and over and over again.

December 24 2012

Four short links: 25 December 2012

  1. RebelMouse — aggregates FB, Twitter, Instagram, G+ content w/Pinboard-like aesthetics. It’s like aggregators we’ve had since 2004, but in this Brave New World we have to authenticate to a blogging service to get our own public posts out in a machine-readable form. 2012: it’s like 2000 but now we have FOUR AOLs! We’ve traded paywalls for graywalls, but the walls are still there. (via Poynter)
  2. Data Visualization Course Wiki — wiki for Stanford course cs448b, covering visualization with examples and critiques.
  3. Peristaltic Pump — for your Arduino medical projects, a pump that doesn’t touch the liquid it moves so the liquid can stay sterile.
  4. Breeze — MIT-licensed Javascript framework for building rich web apps.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl