Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

June 14 2012

Stories over spreadsheets

I didn't realize how much I dislike spreadsheets until I was presented with a vision of the future where their dominance isn't guaranteed.

That eye-opening was offered by Narrative Science CTO Kris Hammond (@whisperspace) during a recent interview. Hammond's company turns data into stories: They provide sentences and paragraphs instead of rows and columns. To date, much of the attention Narrative Science has received has focused on the media applications. That's a natural starting point. Heck, I asked him about those very same things when I first met Hammond at Strata in New York last fall. But during our most recent chat, Hammond explored the other applications of narrative-driven data analysis.

"Companies, God bless them, had a great insight: They wanted to make decisions based upon the data that's out there and the evidence in front of them," Hammond said. "So they started gathering that data up. It quickly exploded. And they ended up with huge data repositories they had to manage. A lot of their effort ended up being focused on gathering that data, managing that data, doing analytics across that data, and then the question was: What do we do with it?"

Hammond sees an opportunity to extract and communicate the insights locked within company data. "We'll be the bridge between the data you have, the insights that are in there, or insights we can gather, and communicating that information to your clients, to your management, and to your different product teams. We'll turn it into something that's intelligible instead of a list of numbers, a spreadsheet, or a graph or two. You get a real narrative; a real story in that data."

My takeaway: The journalism applications of this are intriguing, but these other use cases are empowering.

Why? Because most people don't speak fluent "spreadsheet." They see all those neat rows and columns and charts, and they know something important is tucked in there, but what that something is and how to extract it aren't immediately clear. Spreadsheets require effort. That's doubly true if you don't know what you're looking for. And if data analysis is an adjacent part of a person's job, more effort means those spreadsheets will always be pushed to the side. "I'll get to those next week when I've got more time ..."

We all know how that plays out.

But what if the spreadsheet wasn't our default output anymore? What if we could take things most of us are hard-wired to understand — stories, sentences, clear guidance — and layer it over all that vital data? Hammond touched on that:

"For some people, a spreadsheet is a great device. For most people, not so much so. The story. The paragraph. The report. The prediction. The advisory. Those are much more powerful objects in our world, and they're what we're used to."

He's right. Spreadsheets push us (well, most of us) into a cognitive corner. Open a spreadsheet and you're forced to recalibrate your focus to see the data. Then you have to work even harder to extract meaning. This is the best we can do?

With that in mind, I asked Hammond if the spreadsheet's days are numbered.

"There will always be someone who uses a spreadsheet," Hammond said. "But, I think what we're finding is that the story is really going to be the endpoint. If you think about it, the spreadsheet is for somebody who really embraces the data. And usually what that person does is they reduce that data down to something that they're going to use to communicate with someone else."

A thought on dashboards

I used to view dashboards as the logical step beyond raw data and spreadsheets. I'm not so sure about that anymore, at least in terms of broad adoption. Dashboards are good tools, and I anticipate we'll have them from now until the end of time, but they're still weighed down by a complexity that makes them inaccessible.

It's not that people can't master the buttons and custom reports in dashboards; they simply don't have time. These people — and I include myself among them — need something faster and knob-free. Simplicity is the thing that will ultimately democratize data reporting and data insights. That's why the expansion of data analysis requires a refinement beyond our current dashboards. There's a next step that hasn't been addressed.

Does the answer lie in narrative? Will visualizations lead the way? Will a hybrid format take root? I don't know what the final outputs will look like, but the importance of data reporting means someone will eventually crack the problem.

Full interview

You can see the entire discussion with Hammond in the following video.


April 09 2012

Operations, machine learning and premature babies

Julie Steele and I recently had lunch with Etsy's John Allspaw and Kellan Elliott-McCrea. I'm not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I've written several times about IBM's work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That's amazing in itself, but what's more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you'd intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM's Vice President of Big Data, the telltale signal wasn't spikes or irregularities, but the opposite. There's a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn't exhibit the variation. Their heart rate was too normal; it didn't change throughout the day as much as it should.

That observation strikes me as revolutionary. It's easy to detect problems when something goes out of spec: If you have a fever, you know you're sick. But how do you detect problems that don't set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

In our conversation, we started wondering how this applied to web operations. We have gigabytes of data streaming off of our servers, but the state of system and network monitoring hasn't changed in years. We look for parameters that are out of spec, thresholds that are crossed. And that's good for a lot of problems: You need to know if the number of packets coming into an interface suddenly goes to zero. But what if the symptom we should look for is radically different? What if crossing a threshold isn't what indicates trouble, but the disappearance (or diminution) of some regular pattern? Is it possible that our computing infrastructure also exhibits symptoms that are too subtle for a human to notice but would easily be detectable via machine learning?

We talked a bit about whether it was possible to alarm on the first (and second) derivatives of some key parameters, and of course it is. Doing so would require more sophistication than our current monitoring systems have, but it's not too hard to imagine. But it also misses the point. Once you know what to look for, it's relatively easy to figure out how to detect it. IBM's insight wasn't detecting the patterns that indicated a baby was about to become sick, but using machine learning to figure out what the patterns were. Can we do the same? It's not inconceivable, though it wouldn't be easy.

Web operations has been on the forefront of "big data" since the beginning. Long before we were talking about sentiment analysis or recommendations engines, webmasters and system administrators were analyzing problems by looking through gigabytes of server and system logs, using tools that were primitive or non-existent. MRTG and HP's OpenView were savage attempts to put together information dashboards for IT groups. But at most enterprises, operations hasn't taken the next step. Operations staff doesn't have the resources (neither computational nor human) to apply machine intelligence to our problems. We'd have to capture all the data coming off our our servers for extended periods, not just the server logs that we capture now, but any every kind of data we can collect: network data, environmental data, I/O subsystem data, you name it. At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20


Sponsored post

April 05 2012

Data as seeds of content

Despite the attention big data has received in the media and among the technology community, it is surprising that we are still shortchanging the full capabilities of what data can do for us. At times, we get caught up in the excitement of the technical challenge of processing big data and lose sight of the ultimate goal: to derive meaningful insights that can help us make informed decisions and take action to improve our businesses and our lives.

I recently spoke on the topic of automating content at the O'Reilly Strata Conference. It was interesting to see the various ways companies are attempting to make sense out of big data. Currently, the lion's share of the attention is focused on ways to analyze and crunch data, but very little has been done to help communicate results of big data analysis. Data can be a very valuable asset if properly exploited. As I'll describe, there are many interesting applications one can create with big data that can describe insights or even become monetizable products.

To date, the de facto format for representing big data has been visualizations. While visualizations are great for compacting a large amount of data into something that can be interpreted and understood, the problem is just that — visualizations still require interpretation. There were many sessions at Strata about how to create effective visualizations, but the reality is the quality of visualizations in the real world varies dramatically. Even for the visualizations that do make intuitive sense, they often require some expertise and knowledge of the underlying data. That means a large number of people who would be interested in the analysis won't be able to gain anything useful from it because they don't know how to interpret the information.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

To be clear, I'm a big fan of visualizations, but they are not the end-all in data analysis. They should be considered just one tool in the big data toolbox. I think of data as the seeds for content, whereby data can ultimately be represented in a number of different formats depending on your requirements and target audiences. In essence, data are the seeds that can spout as large a content tree as your imagination will allow.

Below, I describe each limb of the content tree. The examples I cite are sports related because that's what we've primarily focused on at my company, Automated Insights. But we've done very similar things in other content areas rich in big data, such as finance, real estate, traffic and several others. In each case, once we completed our analysis and targeted the type of content we wanted to create, we completely automated the future creation of the content.

Long-form content

By long-form, I mean three or more paragraphs — although it could be several pages or even book length — that use human-readable language to reveal key trends, records and deltas in data. This is the hardest form of content to automate, but technology in this space is rapidly improving. For example, here is a recap of an NFL game generated out of box score and play-by-play data.

A long-form sports recap driven by data
A long-form sports recap driven by data. See the full story.

Short-form content

These are bullets, headlines, and tweets of insights that can boil a huge dataset into very actionable bits of language. For example, here is a game notes article that was created automatically out of an NCAA basketball box score and historical stats.

Mobile and social content

We've done a lot of work creating content for mobile applications and various social networks. Last year, we auto-generated more than a half-million tweets. For example, here is the automated Twitter stream we maintain that covers UNC Basketball.


By metrics, I'm referring to the process of creating a single number that's representative of a larger dataset. Metrics are shortcuts to boil data into something easier to understand. For instance, we've created metrics for various sports, such as a quarterback ranking system that's based on player performance.

Real-time updates

Instead of thinking of data as something you crunch and analyze days or weeks after it was created, there are opportunities to turn big data into real-time information that provides interested users with updates as soon as they occur. We have a real-time NCAA basketball scoreboard that updates with new scores.

Content applications

This is one few people consider, but creating content-based applications is a great way to make use of and monetize data. For example, we created StatSmack, which is an app that allows sports fans to discover 10-20+ statistically based "slams" that enable them to talk trash about any team.

A variation on visualizations

Used in the right context, visualizations can be an invaluable tool for understanding a large dataset. The secret is combining bulleted text-based insights with the graphical visualization to allow them to work together to truly inform the user. For example, this page has a chart of win probability over the course of game seven of the 2011 World Series game. It shows the ebb and flow of the game.

Win probability from World Series 2011 game 7
Play-by-play win probability from game seven of the 2011 World Series.

What now?

As more people get their heads around how to crunch and analyze data, the issue of how to effectively communicate insights from that data will be a bigger concern. We are still in the very early stages of this capability, so expect a lot of innovation over the next few years related to automating the conversion of data to content.


March 30 2012

Automated science, deep data and the paradox of information

A lot of great pieces have been written about the relatively recent surge in interest in big data and data science, but in this piece I want to address the importance of deep data analysis: what we can learn from the statistical outliers by drilling down and asking, "What's different here? What's special about these outliers and what do they tell us about our models and assumptions?”

The reason that big data proponents are so excited about the burgeoning data revolution isn't just because of the math. Don't get me wrong, the math is fun, but we're excited because we can begin to distill patterns that were previously invisible to us due to a lack of information.

That's big data.

Of course, data are just a collection of facts; bits of information that are only given context — assigned meaning and importance — by human minds. It's not until we do something with the data that any of it matters. You can have the best machine learning algorithms, the tightest statistics, and the smartest people working on them, but none of that means anything until someone makes a story out of the results.

And therein lies the rub.

Do all these data tell us a story about ourselves and the universe in which we live, or are we simply hallucinating patterns that we want to see?

(Semi)Automated science

In 2010, Cornell researchers Michael Schmidt and Hod Lipson published a groundbreaking paper in "Science" titled, "Distilling Free-Form Natural Laws from Experimental Data". The premise was simple, and it essentially boiled down to the question, "can we algorithmically extract models to fit our data?"

So they hooked up a double pendulum — a seemingly chaotic system whose movements are governed by classical mechanics — and trained a machine learning algorithm on the motion data.

Their results were astounding.

In a matter of minutes the algorithm converged on Newton's second law of motion: f = ma. What took humanity tens of thousands of years to accomplish was completed on 32-cores in essentially no time at all.

In 2011, some neuroscience colleagues of mine, lead by Tal Yarkoni, published a paper in "Nature Methods" titled "Large-scale automated synthesis of human functional neuroimaging data". In this paper the authors sought to extract patterns from the overwhelming flood of brain imaging research.

To do this they algorithmically extracted the 3D coordinates of significant brain activations from thousands of neuroimaging studies, along with words that frequently appeared in each study. Using these two pieces of data along with some simple (but clever) mathematical tools, they were able to create probabilistic maps of brain activation for any given term.

In other words, you type in a word such as "learning" on their website search and visualization tool, NeuroSynth, and they give you back a pattern of brain activity that you should expect to see during a learning task.

But that's not all. Given a pattern of brain activation, the system can perform a reverse inference, asking, "given the data that I'm observing, what is the most probable behavioral state that this brain is in?"

Similarly, in late 2010, my wife (Jessica Voytek) and I undertook a project to algorithmically discover associations between concepts in the peer-reviewed neuroscience literature. As a neuroscientist, the goal of my research is to understand relationships between the human brain, behavior, physiology, and disease. Unfortunately, the facts that tie all that information together are locked away in more than 21 million static peer-reviewed scientific publications.

How many undergrads would I need to hire to read through that many papers? Any volunteers?

Even more mind-boggling, each year more than 30,000 neuroscientists attend the annual Society for Neuroscience conference. If we assume that only two-thirds of those people actually do research, and if we assume that they only work a meager (for the sciences) 40 hours a week, that's around 40 million person-hours dedicated to but one branch of the sciences.


This means that in the 10 years I've been attending that conference, more than 400 million person-hours have gone toward the pursuit of understanding the brain. Humanity built the pyramids in 30 years. The Apollo Project got us to the moon in about eight.

So my wife and I said to ourselves, "there has to be a better way".

Which lead us to create brainSCANr, a simple (simplistic?) tool (currently itself under peer review) that makes the assumption that the more often that two concepts appear together in the titles or abstracts of published papers, the more likely they are to be associated with one another.

For example, if 10,000 papers mention "Alzheimer's disease" that also mention "dementia," then Alzheimer's disease is probably related to dementia. In fact, there are 17,087 papers that mention Alzheimer's and dementia, whereas there are only 14 papers that mention Alzheimer's and, for example, creativity.

From this, we built what we're calling the "cognome", a mapping between brain structure, function, and disease.

Big data, data mining, and machine learning are becoming critical tools in the modern scientific arsenal. Examples abound: text mining recipes to find cultural food taste preferences, analyzing cultural trends via word use in books ("culturomics"), identifying seasonality of mood from tweets, and so on.

But so what?

Deep data

What those three studies show us is that it's possible to automate, or at least semi-automate, critical aspects of the scientific method itself. Schmidt and Lipson show that it is possible to extract equations that perfectly model even seemingly chaotic systems. Yarkoni and colleagues show that it is possible to infer a complex behavioral state given input brian data.

My wife and I wanted to show that brainSCANr could be put to work for something more useful than just quantifying relationships between terms. So we created a simple algorithm to perform what we're calling "semi-automated hypothesis generation," which is predicated on a basic "the friend of a friend should be a friend" concept.

In the example below, the neurotransmitter "serotonin" has thousands of shared publications with "migraine," as well as with the brain region "striatum." However, migraine and striatum only share 16 publications.

That's very odd. Because in medicine there is a serotonin hypothesis for the root cause of migraines. And we (neuroscientists) know that serotonin is released in the striatum to modulate brain activity in that region. Given that those two things are true, why is there so little research regarding the role of the striatum in migraines?

Perhaps there's a missing connection?

Such missing links and other outliers in our models are the essence of deep data analytics. Sure, any data scientist worth their salt can take a mountain of data and reduce it down to a few simple plots. And such plots are important because they tell a story. But those aren't the only stories that our data can tell us.

For example, in my geoanalytics work as the data evangelist for Uber, I put some of my (definitely rudimentary) neuroscience network analytic skills to work to figure out how people move from neighborhood to neighborhood in San Francisco.

At one point, I checked to see if men and women moved around the city differently. A very simple regression model showed that the number of men who go to any given neighborhood significantly predicts the number of woman who go to that same neighborhood.

No big deal.

But what's cool was seeing where the outliers were. When I looked at the models' residuals, that's where I found the far more interesting story. While it's good to have a model that fits your data, knowing where the model breaks down is not only important for internal metrics, but it also makes for a more interesting story:

What's happening in the Marina district that so many more women want to go there? And why are there so many more men in SoMa?

The paradox of information

The interpretation of big data analytics can be a messy game. Maybe there are more men in SoMa because that's where AT&T Park is. But maybe there are just five guys who live in SoMa who happen to take Uber 100 times more often than average.

While data-driven posts make for fun reading (and writing), in the sciences we need to be more careful that we don't fall prey to ad hoc, just-so stories that sound perfectly reasonable and plausible, but which we cannot conclusively prove.

In 2008, psychologists David McCabe and Alan Castel published a paper in the journal "Cognition," titled, "Seeing is believing: The effect of brain images on judgments of scientific reasoning". In that paper, they showed that summaries of cognitive neuroscience findings that are accompanied by an image of a brain scan were rated as more credible by the readers.

This should cause any data scientist serious concern. In fact, I've formulated three laws of statistical analyses:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

The first law is closely related to the "bike shed effect" (also known as Parkinson's Law of Triviality) which states that, "the time spent on any item of the agenda will be in inverse proportion to the sum involved."

In other words, if you try to build a simple thing such as a public bike shed, there will be endless town hall discussions wherein people argue over trivial details such as the color of the door. But if you want to build a nuclear power plant — a project so vast and complicated that most people can't understand it — people will defer to expert opinion.

Such is the case with statistics.

If you make the mistake of going into the comments section of any news piece discussing a scientific finding, invariably someone will leave the comment, "correlation does not equal causation."

We'll go ahead and call that truism Voytek's fourth law.

But people rarely have the capacity to argue against the methods and models used by, say, neuroscientists or cosmologists.

But sometimes we get perfect models without any understanding of the underlying processes. What do we learn from that?

The always fantastic Radiolab did a follow-up story on the Schmidt and Lipson "automated science" research in an episode titled "Limits of Science". It turns out, a biologist contacted Schmidt and Lipson and gave them data to run their algorithm on. They wanted to figure out the principles governing the dynamics of a single-celled bacterium. Their result?

Well sometimes the stories we tell with data ... they just don't make sense to us.

They found, "two equations that describe the data."

But they didn't know what the equations meant. They had no context. Their variables had no meaning. Or, as Radiolab co-host Jad Abumrad put it, "the more we turn to computers with these big questions, the more they'll give us answers that we just don't understand."

So while big data projects are creating ridiculously exciting new vistas for scientific exploration and collaboration, we have to take care to avoid the Paradox of Information wherein we can know too many things without knowing what those "things" are.

Because at some point, we'll have so much data that we'll stop being able to discern the map from the territory. Our goal as (data) scientists should be to distill the essence of the data into something that tells as true a story as possible while being as simple as possible to understand. Or, to operationalize that sentence better, we should aim to find balance between minimizing the residuals of our models and maximizing our ability to make sense of those models.

Recently, Stephen Wolfram released the results of a 20-year long experiment in personal data collection, including every keystroke he's typed and every email he's sent. In response, Robert Krulwich, the other co-host of Radiolab, concludes by saying "I'm looking at your data [Dr. Wolfram], and you know what's amazing to me? How much of you is missing."

Personally, I disagree; I believe that there's a humanity in those numbers and that Mr. Krulwich is falling prey to the idea that science somehow ruins the magic of the universe. Quoth Dr. Sagan:

"It is sometimes said that scientists are unromantic, that their passion to figure out robs the world of beauty and mystery. But is it not stirring to understand how the world actually works — that white light is made of colors, that color is the way we perceive the wavelengths of light, that transparent air reflects light, that in so doing it discriminates among the waves, and that the sky is blue for the same reason that the sunset is red? It does no harm to the romance of the sunset to know a little bit about it."

So go forth and create beautiful stories, my statistical friends. See you after peer-review.

Where Conference 2012 — Bradley Voytek will examine the connection between geodata and user experience through a number of sessions at O'Reilly's Where Conference, being held April 2-4 in San Francisco.

Save 20% on registration with the code RADAR20


March 20 2012

The unreasonable necessity of subject experts

One of the highlights of the 2012 Strata California conference was the Oxford-style debate on the proposition "In data science, domain expertise is more important than machine learning skill." If you weren't there, Mike Driscoll's summary is an excellent overview (full video of the debate is available here). To make the story short, the "cons" won; the audience was won over to the side that machine learning is more important. That's not surprising, given that we've all experienced the unreasonable effectiveness of data. From the audience, Claudia Perlich pointed out that she won data mining competitions on breast cancer, movie reviews, and customer behavior without any prior knowledge. And Pete Warden (@petewarden) made the point that, when faced with the problem of finding "good" pictures on Facebook, he ran a data mining contest at Kaggle.

Data Science Debate panel at Strata CA 12
The "Data Science Debate" panel at Strata California 2012. Watch the debate.

A good impromptu debate necessarily raises as many questions as it answers. Here's the question that I was left with. The debate focused on whether domain expertise was necessary to ask the right questions, but a recent Guardian article,"The End of Theory," asked a different but related question: Do we need theory (read: domain expertise) to understand the results, the output of our data analysis? The debate focused on a priori questions, but maybe the real value of domain expertise is a posteriori: after-the-fact reflection on the results and whether they make sense. Asking the right question is certainly important, but so is knowing whether you've gotten the right answer and knowing what that answer means. Neither problem is trivial, and in the real world, they're often closely coupled. Often, the only way to know you've put garbage in is that you've gotten garbage out.

By the same token, data analysis frequently produces results that make too much sense. It yields data that merely reflects the biases of the organization doing the work. Bad sampling techniques, overfitting, cherry picking datasets, overly aggressive data cleaning, and other errors in data handling can all lead to results that are either too expected or unexpected. "Stupid Data Miner Tricks" is a hilarious send-up of the problems of data mining: It shows how to "predict" the value of the S&P index over a 10-year period based on butter production in Bangladesh, cheese production in the U.S., and the world sheep population.

Cherry picking and overfitting have particularly bad "smells" that are often fairly obvious: The Democrats never lose a Presidential election in a year when the Yankees win the world series, for example. (Hmmm. The 2000 election was rather fishy.) Any reasonably experienced data scientist should be able to stay out of trouble, but what if you treat your data with care and it still spits out an unexpected result? Or an expected result that's too good to be true? After the data crunching has been done, it's the subject expert's job to ensure that your results are good, meaningful, and well-understood.

Let's say you're an audio equipment seller analyzing a lot of purchase data and you find out that people buy more orange juice just before replacing their home audio system. It's an unlikely, absurd (and completely made up) result, but stranger things have happened. I'd probably go and build an audio gear marketing campaign targeting bulk purchasers of orange juice. Sales would probably go up; data is "unreasonably effective," even if you don't know why. This is precisely where things get interesting, and precisely where I think subject matter expertise becomes important: after the fact. Data breeds data, and it's naive to think that marketing audio gear to OJ addicts wouldn't breed more datasets and more analysis. It's naive to think the OJ data wouldn't be used in combination with other datasets to produce second-, third-, and fourth-order results. That's when the unreasonable effectiveness of data isn't enough; that's when it's important to understand the results in ways that go beyond what data analysis alone can currently give us. We may have a useful result that we don't understand, but is it meaningful to combine that result with other results that we may (or may not) understand?

Let's look at a more realistic scenario. Pete Warden's Kaggle-based
algorithm for finding
quality pictures works well
, despite giving the surprising result that

pictures with "Michigan" in the caption are significantly better than
. (As are pictures from Peru, and pictures taken of
tombs.) Why Michigan? Your guess is as good as mine.
For Warden's application, building photo
albums on the fly for his company Jetpac, that's fine. But if you're building a more complex
system that plans vacations for photographers, you'd
better know more than that. Why are the photographs good? Is
Michigan a
destination for birders? Is it a destination for people who like
tombs? Is it a destination with artifacts from ancient civilizations?
Or would you be better off recommending a trip to Peru?

Another realistic
scenario: Target recently used purchase histories to
target pregnant
women with ads
for baby-related products, with surprising success.
I won't rehash that story. From that starting point, you can go a
lot further. Pregnancies frequently lead to new car purchases.
New car purchases lead to
new insurance premiums, and I expect data will show that women with
babies are safer drivers. At each
step, you're compounding data with more data. It would certainly
be nice to know you understood what was happening at each step of the
way before offering a teenage driver a low insurance premium just
because she thought a large black handbag (that happened
to be appropriate for storing diapers) looked cool.

There's a limit to the value you
can derive from correct but inexplicable results. (Whatever else one
may say about the Target case, it looks like they made
sure they understood the results.) It takes a
subject matter expert to make the leap from correct results to
understood results. In an email, Pete Warden said:

"My biggest worry is that we're making important decisions based on black-box algorithms that may have hidden and problematic biases. If we're deciding who to give a mortgage based on machine learning, and the system consistently turns down black people, how do we even notice it, let alone fix it, unless we understand what the rules are? A real-world case is trading systems. If you have a mass of tangled and inexplicable logic driving trades, how do you assign blame when something like the Flash Crash happens?

"For decades, we've had computer systems we don't understand making decisions for us, but at least when something went wrong we could go in afterward and figure out what the causes were. More and more, we're going to be left shrugging our shoulders when someone asks us for an explanation."

That's why you need subject matter experts to understand your results, rather than simply accepting them at face value. It's easy to imagine that subject matter expertise requires hiring a PhD in some arcane discipline. For many applications, though, it's much more effective to develop your own expertise. In an email exchange, DJ Patil (@dpatil) said that people often become subject experts just by playing with the data. As an undergrad, he had to analyze a dataset about sardine populations off the coast of California. Trying to understand some anomalies led him to ask questions about coastal currents, why biologists only count sardines at certain stages in their life cycle, and more. Patil said:

"... this is what makes an awesome data scientist. They use data to have a conversation. This way they learn and bring other data elements together, create tests, challenge hypothesis, and iterate."

By asking questions of the data, and using those questions to ask more questions, Patil became an expert in an esoteric branch of marine biology, and in the process greatly increased the value of his results.

When subject expertise really isn't available, it's possible to create a workaround through clever application design. One of my takeaways from Patil's "Data Jujitsu" talk was the clever way LinkedIn "crowdsourced" subject matter expertise to their membership. Rather than sending job recommendations directly to a member, they'd send them to a friend, and ask the friend to pass along any they thought appropriate. This trick doesn't solve problems with hidden biases, and it doesn't give LinkedIn insight into why any given recommendation is appropriate, but it does an effective job of filtering inappropriate recommendations.

Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes "unreasonably effective" through the conversation that takes place after the numbers have been crunched. At his Strata keynote, Avinash Kaushik (@avinash) revisited Donald Rumsfeld's statement about known knowns, known unknowns, and unknown unknowns, and argued that the "unknown unknowns" are where the most interesting and important results lie. That's the territory we're entering here: data-driven results we would never have expected. We can only take our inexplicable results at face value if we're just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they're based. And that's the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems.

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.


March 17 2012

Profile of the Data Journalist: The Homicide Watch

Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted in-person and email interviews during the 2012 NICAR Conference and published a series of data journalist profiles here at Radar.

Chris Amico (@eyeseast) is a journalist and web developer based in Washington, DC, where he works on NPR's State Impact project, building a platform for local reporters covering issues in their states. Laura Norton Amico (@LauraNorton) is the editor of Homicide Watch (@HomicideWatch), an online community news platform in Washington, D.C. that aspires to cover every homicide in the District of Columbia. And yes, the similar names aren't a coincidence: the Amicos were married in 2010.

Since Homicide Watch launched in 2009, it's been earning praise and interest from around the digital world, including a profile by the Nieman Lab at Harvard University that asked whether a local blog "could fill the gaps of DC's homicide coverage. Notably, Homicide Watch has turned up a number of unreported murders.

In the process, the site has also highlighted an important emerging set of data that other digital editors should consider: using inbound search engine analytics for reporting. As Steve Myers reported for the Poynter Institute, Homicide Watch used clues in site search queries to ID a homicide victim. We'll see if the Knight Foundation think this idea has legs: the husband and wife team have applied for a Knight News Challenge grant to build a tooklit for real-time investigative reporting from site analytics.

The Amico's success with the site - which saw big growth in 2011 -- offers an important case study into why organizing beats may well hold similar importance as investigative projects. It also will be a case study with respect to sustainability and business models for the "new news,"as Homicide Watch looks to license its platform to news outlets across the country.

Below, I've embedded a presentation on Homicide Watch from the January 2012 meeting of the Online News Association. Our interview follows.

Watch live streaming video from onlinenewsassociation at

Where do you work now? What is a day in your life like?

Laura: I work full time right now for Homicide Watch, a database driven beat publishing platform for covering homicides. Our flagship site is in DC, and I’m the editor and primary reporter on that site as well as running business operations for the brand.

My typical days start with reporting. First, news checks, and maybe posting some quick posts on anything that’s happened overnight. After that, it’s usually off to court to attend hearings and trials, get documents, reporting stuff. I usually have to to-do list for the day that includes business meetings, scheduling freelancers, mapping out long-term projects, doing interviews about the site, managing our accounting, dealing with awards applications, blogging about the start-up data journalism life on my personal blog and for ONA at, guest teaching the occasional journalism class, and meeting deadlines for freelance stories. The work day never really ends; I’m online keeping an eye on things until I go to bed.

Chris: I work for NPR, on the State Impact project, where I build news apps and tools for journalists. With Homicide Watch, I work in short bursts, usually an hour before dinner and a few hours after. I’m a night owl, so if I let myself, I’ll work until 1 or 2 a.m., just hacking at small bugs on the site. I keep a long list of little things I can fix, so I can dip into the codebase, fix something and deploy it, then do something else. Big features, like tracking case outcomes, tend to come from weekend code sprints.

How did you get started in data journalism? Did you get any special degrees or certificates?

Laura: Homicide Watch DC was my first data project. I’ve learned everything I know now from conceiving of the site, managing it as Chris built it, and from working on it. Homicide Watch DC started as a spreadsheet. Our start-up kit for newsrooms starting Homicide Watch sites still includes filling out a spreadsheet. The best lesson I learned when I was starting out was to find out what all the pieces are and learn how to manage them in the simplest way possible.

Chris: My first job was covering local schools in southern California, and data kept creeping into my beat. I liked having firm answers to tough questions, so I made sure I knew, for example, how many graduates at a given high school met the minimum requirements for college. California just has this wealth of education data available, and when I started asking questions of the data, I got stories that were way more interesting.

I lived in Dalian, China for a while. I helped start a local news site with two other expats (Alex Bowman and Rick Martin). We put everything we knew about the city -- restaurant reviews, blog posts, photos from Flickr -- into one big database and mapped it all. It was this awakening moment when suddenly we had this resource where all the information we had was interlinked. When I came back to California, I sat down with a book on Python and Django and started teaching myself to code. I spent a year freelancing in the Bay Area, writing for newspapers by day, learning Python by night. Then the NewsHour hired me.

Did you have any mentors? Who? What were the most important resources they shared with you?

Laura: Chris really coached me through the complexities of data journalism when we were creating the site. He taught me that data questions are editorial questions. When I realized that data could be discussed as an editorial approach, it opened the crime beat up. I learned to ask questions of the information I was gathering in a new way.

Chris: My education has been really informal. I worked with a great reporter at my first job, Bob Wilson, who is a great interviewer of both people and spreadsheets. At NewsHour, I worked with Dante Chinni on Patchwork Nation, who taught me about reporting around a central organizing principle. Since I’ve started coding, I’ve ended up in this great little community of programmer-journalists where people bound ideas around and help each other out.

What does your personal data journalism "stack" look like? What tools could you not live without?

Laura: The site itself and its database which I report to and from, WordPress, Wordpress analytics, Google Analytics, Google Calendar, Twitter, Facebook, Storify, Document Cloud, VINElink, and DC Superior Court’s online case lookup.

Chris: Since I write more Python than prose these days, I spend most of my time in a text editor (usually TextMate) on a MacBook Pro. I try not to do anything with git.

What data journalism project are you the most proud of working on or creating?

Laura: Homicide Watch is the best thing I’ve ever done. It’s not just about the data, and it’s not just about the journalism, but it’s about meeting a community need in an innovative way. I stared thinking about a Homicide Watchtype site when I was trying to follow a few local cases shortly after moving to DC. It was nearly impossible to find news sources for the information. I did find that family and friends of victims and suspects were posting newsy updates in unusual places -- online obituaries and Facebook memorial pages, for example. I thought a lot about how a news product could fit the expressed need for news, information, and a way for the community to stay in touch about cases.

The data part developed very naturally out of that. The earliest description of the site was “everything a reporter would have in their notebook or on their desk while covering a murder case from start to finish.” That’s still one of the guiding principals of the site, but it’s also meant that organizing that information is super important. What good is making court dates public if you’re not doing it on a calendar, for example.

We started, like I said, with a spreadsheet that listed everything we knew: victim name, age, race, gender, method of death, place of death, link to obituary, photo, suspect name, age, race, gender, case status, incarceration status, detective name, age, race, gender, phone number, judge assigned to case, attorneys connected to the case, co-defendants, connections to other murder cases.

And those are just the basics. Any reporter covering a murder case, crime to conviction, should have that information. What Homicide Watch does is organize it, make as much of it public as we can, and then report from it. It’s led to some pretty cool work, from developing a method to discover news tips in analytics, to simply building news packages that accomplish more than anyone else can.

Chris: Homicide Watch is really the project I wanted to build for years. It’s data-driven beat reporting, where the platform and the editorial direction are tightly coupled. In a lot of ways, it’s what I had in mind when I was writing about frameworks for reporting.

The site is built to be a crime reporter’s toolkit. It’s built around the way Laura works, based on our conversations over the dinner table for the first six months of the site’s existence. Building it meant understanding the legal system, doing reporting and modeling reality in ways I hadn’t done before, and that was a challenge on both the technical and editorial side.

Where do you turn to keep your skills updated or learn new things?

Laura: Assigning myself new projects and tasks is the best way for me to learn; it forces me to find solutions for what I want to do. I’m not great at seeking out resources on my own, but I keep a close eye on Twitter for what others are doing, saying about it, and reading.

Chris: Part of my usual morning news reading is a run through a bunch of programming blogs. I try to get exposed to technologies that have no immediate use to me, just so it keeps me thinking about other ways to approach a problem and to see what other problems people are trying to solve.

I spend a lot of time trying to reverse-engineer other people’s projects, too. Whenever someone launches a new news app, I’ll try to find the data behind it, take a dive through the source code if it’s available and generally see if I can reconstruct how it came together.

Why are data journalism and "news apps" important, in the context of the contemporary digital environment for information?

Laura: Working on Homicide Watch has taught me that news is about so much more than “stories.” If you think about a typical crime brief, for example, there’s a lot of information in there, starting with the "who-what-where-when." Once that brief is filed and published, though, all of that information disappears.

Working with news apps gives us the ability to harness that information and reuse/repackage it. It’s about slicing our reporting in as many ways as possible in order to make the most of it. On Homicide Watch, that means maintaining a database and creating features like victims’ and suspects’ pages. Those features help regroup, refocus, and curate the reporting into evergreen resources that benefit both reporters and the community.

Chris: Spend some time with your site analytics. You’ll find that there’s no one thing your audience wants. There isn’t even really one audience. Lots of people want lots of different things at different times, or at least different views of the information you have.

One of our design goals with Homicide Watch is “never hit a dead end.” A user may come in looking for information about a certain case, then decide she’s curious about a related issue, then wonder which cases are closed. We want users to be able to explore what we’ve gathered and to be able to answer their own questions. Stories are part of that, but stories are data, too.

February 07 2012

Unstructured data is worth the effort when you've got the right tools

It's dawning on companies that data analysis can yield insights and inform business decisions. As data-driven benefits grow, so do our demands about what more data can tell us and what other types we can mine.

During her PhD studies, Alyona Medelyan (@zelandiya) developed Maui, an open source tool that performs as well as professional librarians in identifying main topics in documents. Medelyan now leads the research and development of API-based products at Pingar.

Pingar senior software researcher Anna Divoli (@annadivoli) studied sentence extraction for semi-automatic annotation of biological databases. Her current research focuses on developing methodologies for acquiring knowledge from textual data.

"Big data is important in many diverse areas, such as science, social media, and enterprise," observes Divoli. "Our big data niche is analysis of unstructured text." In the interview below, Medelyan and Divoli describe their work and what they see on the horizon for unstructured data analysis.

How did you get started in big data?

Anna Divoli: I began working with big data as it relates to science during my PhD. I worked with bioinformaticians who mined proteomics data. My research was on mining information from the biomedical literature that could serve as annotation in a database of protein families.

Alyona Medelyan: Like Anna, I mainly focus on unstructured data and how it can be managed using clever algorithms. During my PhD in natural language processing and data mining, I started applying such algorithms to large datasets to investigate how time-consuming data analysis and processing tasks can be automated.

What projects are you working on now?

Alyona Medelyan: For the past two years at Pingar, I've been developing solutions for enterprise customers who accumulate unstructured data and want to search, analyze, and explore this data efficiently. We develop entity extraction, text summarization, and other text analytics solutions to help scrub and interpret unstructured data in an organization.

Anna Divoli: We're focusing on several verticals that struggle with too much textual data, such as bioscience, legal, and government. We also strive to develop language-independent solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What are the trends and challenges you're seeing in the big data space?

Anna Divoli: There are plenty of trends that span various aspects of big data, such as making the data accessible from mobile devices, cloud solutions, addressing security and privacy issues, and analyzing social data.

One trend that is pertinent to us is the increasing popularity of APIs. Plenty of APIs exist that give access to large datasets, but there also powerful APIs that manage big data efficiently, such as text analytics, entity extraction, and data mining APIs.

Alyona Medelyan: The great thing about APIs is that they can be integrated into existing applications used inside an organization.

With regard to the challenges, enterprise data is very messy, inconsistent, and spread out across multiple internal systems and applications. APIs like the ones we're working on can bring consistency and structure to a company's legacy data.

The presentation you'll be giving at the Strata Conference will focus on practical applications of mining unstructured data. Why is this an important topic to address?

Anna Divoli: Every single organization in every vertical deals with unstructured data. Tons of text is produced daily — emails, reports, proposals, patents, literature, etc. This data needs to be mined to allow fast searching, easy processing, and quick decision making.

Alyona Medelyan: Big data often stands for structured data that is collected into a well-defined database — who bought which book in an online bookstore, for example. Such databases are relatively easy to mine because they have a consistent form. At the same time, there is plenty of unstructured data that is just as valuable, but it's extremely difficult to analyze it because it lacks structure. In our presentation, we will show how to detect structure using APIs, natural language processing and text mining, and demonstrate how this creates immediate value for business users.

Are there important new tools or projects on the horizon for big data?

Alyona Medelyan: Text analytics tools are very hot right now, and they improve daily as scientists come up with new ways of making algorithms understand written text more accurately. It is amazing that an algorithm can detect names of people, organizations, and locations within seconds simply by analyzing the context in which words are used. The trend for such tools is to move toward recognition of further useful entities, such as product names, brands, events, and skills.

Anna Divoli: Also, entity relation extraction is an important trend. A relation that consistently connects two entities in many documents is important information in science and enterprise alike. Entity relation extraction helps detect new knowledge in big data.

Other trends include detecting sentiment in social data, integrating multiple languages, and applying text analytics to audio and video transcripts. The number of videos grows at a constant rate, and transcripts are even more unstructured than written text because there is no punctuation. That's another exciting area on the horizon!

Who do you follow in the big data community?

Alyona Medelyan: We tend to follow researchers in areas that are used for dealing with big data, such as natural language processing, visualization, user experience, human computer information retrieval, as well as the semantic web. Two of them are also speaking at Strata this year: Daniel Tunkelang and Marti Hearst.

This interview was edited and condensed.


Reposted bycheg00 cheg00

January 11 2012

Can Maryland's other "CIO" cultivate innovation in government?

Bryan Sivak at OSCON 2010When Maryland hired Bryan Sivak last April as the state's chief innovation officer, the role had yet to be defined in government. After all, like most other states in the union, Maryland had never had a chief innovation officer before.

Sivak told TechPresident on his second day at work that he wanted to define what it means to build a system for innovation in government:

If you can systemize what it means to be innovative, what it means to challenge the status quo without a budget, without a lot of resources, then you've created something that can be replicated anywhere.

Months later, Sivak (@BryanSivak) has been learning — and sharing — as he goes. That doesn't mean he walked into the role without ideas about how government could be more innovative. Anything but. Sivak's years in the software industry and his tenure as the District of Columbia's chief technology officer equipped him with plenty of ideas, along with some recognition as a Gov 2.0 Hero from Govfresh.

Sivak was a genuine change agent during his tenure in DC. As DCist reported, Sivak oversaw the development of several projects while he was in office, like the District's online service request center and "the incredibly useful TrackDC."

Citizensourcing better ideas

One of the best ideas that Sivak brought to his new gig was culled directly from the open government movement: using collective intelligence to solve problems.

"My job is to fight against the entrenched status quo," said Sivak in an interview this winter. "I'm not a subject expert in 99% of issues. The people who do those jobs, live and breathe them, do know what's happening. There are thousands and thousands of people asking 'why can't we do this this way? My job is to find them, help them, get them discovered, and connect them."

That includes both internal and external efforts, like a pilot partnership with citizens to report downed trees last year.

An experiment with SeeClickFix during Hurricane Irene in August 2011 had a number of positive effects, explained Sivak. "It made emergency management people realize that they needed to look at this stuff," he said. "Our intention was to get people thinking. The new question is now, 'How do we figure out how to use it?' They're thinking about how to integrate it into their process."

Gathering ideas for making government work better from the public presents some challenges. For instance, widespread public frustration with the public sector can also make citizensourcing efforts a headache to architect and govern. Sivak suggested trying to get upset citizens involved in addressing the problems they highlight in public comments.

"Raise the issue, then channel the negative reactions into fixing the issues," he said. "Why not get involved? There are millions of civil servants trying to do the right thing every day."

In general, Sivak said, "the vast majority of people are there to do a good job. The problem is rules and regulations built up over centuries that prevent us from doing that the best way."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Doing more with less

If innovation is driven by resource constraints, by "doing more with less," Sivak will be in the right place at the right time. Maryland Governor Martin O'Malley's 2012 budget included billions in proposed cuts, including hundreds of millions pared from state agencies. More difficult decisions will be in the 2013 budget as well.

The challenge now is how to bring ideas to fruition in the context of state government, where entrenched bureaucracy, culture, aging legacy IT systems and more prosaic challenges around regulations stand in the way.

One clear direction is to find cost savings in modern IT, where possible. Moving government IT systems into the cloud won't be appropriate in all circumstances. Enterprise email, however, appears to be a ripe area for migration. Maryland is moving to the cloud for email, specifically to Google Apps for Enterprise.

This will merge 57 systems into one, said Sivak. "Everyone is really jazzed." The General Services Agency saved an estimated $11 million for 13,000 employees, according to Sivak. "We hope to save more. People don't factor in upgrade costs."

He's found, however, that legacy IT systems aren't the most significant hindrance to innovation in government. "When I started public service [in D.C. government], procurement and HR were the things I was least interested in," said Sivak. "Now, they are the things I'm most interested in. Fix them and you fix many of the problems."

The problem with reform of these areas, however, is that it's neither a particularly sexy issue for politicians to run on in election years nor focus upon in office. Sivak told TechPresident last year "that to be successful with technology initiatives, you need to attack the underlying culture and process first."

Given the necessary focus on the economy and job creation, the Maryland governor's office is also thinking about how to attract and sustain entrepreneurs and small businesses. "We're also working on making research and development benefit the state more," Sivak said. "Maryland is near the top of R&D spending but very low on commercializing the outcomes."

In an extended interview conducted via email (portions posted below), Sivak expanded further about his view of innovation, what's on his task list and some of the projects that he's been working on to date.

How do you define innovation? What pockets of innovation in government inspire you?

Bryan Sivak: Innovation is an overused term in nearly every vertical — both the public and private sectors — which is why a definition is important. My current working definition is something like this:

Innovation challenges existing processes and systems, resulting in the injection, rapid execution and validation of new ideas into the ecosystem. In short, innovation asks "why?" a lot.

Note that this is my current working definition and might change without notice.

What measures has Maryland taken to attract and retain startups?

Bryan Sivak: There are a number of entities across the state that are focused on Maryland's startup ecosystem. Many are on the local level and the private academic side (incubators, accelerators, etc.), but as a state we have organizations that are — at least partially — focused on this as well. TEDCO, for example, is a quasi-public entity focused on encouraging technology development across the state. And the Department of Business and Economic Development has a number of people who are focused on building the state's startup infrastructure.

One of the things I've been focusing on is the "commercialization gap," specifically the fact that Maryland ranks No. 1 per capita in PhD scientists and engineers, No. 1 in federal research and development dollars per capita, and No. 1 in the best public schools in the country, but it is ranked No. 37 in terms of entrepreneurial activity. We are working on coming up with a package to address this gap and to help commercialize technologies that are a result of R&D investment into our academic and research institutions.

What about the cybersecurity industry?

Cybersecurity is a big deal in Maryland, and in 2010, the Department of Business and Economic Development released its Cyber Maryland plan, which contains 10 priorities that the state is working on to make Maryland the cybersecurity hub of the U.S. Given the preponderance of talent and specific institutions in the state, it makes a ton of sense and builds on assets we already have in place.

What have you learned from Maryland's crowdsourcing efforts to date?

Bryan Sivak: We've really just started to dip our toes in the crowdsourcing waters, but it's been very interesting so far. The desire is there — people definitely want to contribute — but what's become very clear is that we need a process in place on the back end to handle incoming items. On the public safety front, for example, most of the issues that get reported by citizens will be dealt with by the locals, as opposed to the state. We need a mechanism for issues to be reported and tracked in a single interface but acted upon by the appropriate entity.

This is much easier on the local side since all groups are theoretically on the same page. We are also building ad-hoc processes on the fly to handle responses to other crowdsourced inputs. For example, we recently asked citizens and businesses for ideas for regulatory reform in the state. In order to make sure these inputs were handled correctly, we created a manual, human-based process to filter the ideas and make sure the right people at the right agencies saw them. This worked well for this initiative, but it is obviously not scalable for implementation on a broad scale.

The conclusion is that the desire and ability for people external to the government to contribute is not going to decrease, so if we are proactive on this issue and try to stay ahead of or with the curve, everyone — government, residents, and businesses — will benefit.

What roles do data and analytics play in Maryland's governance processes and policy making?

Bryan Sivak: They play huge roles. The governor [Martin O'Malley] is well known for his belief in data-driven decision making, which was the impetus behind the creation of CitiStat in Baltimore and StateStat in Maryland. We use dashboards to track nearly every initiative, and this data features prominently in almost every policy discussion. As an example, check out the Governor's Delivery Unit website, where we publish a good amount of analysis we use to track achievement of goals. We are now working on building a robust data warehouse that will not only enable us to provide a deeper level of analysis on the fly, and on both a preset and an ad-hoc basis, but also give us the added benefit of easily publishing raw data to the community at large.

What have you learned through StateStat? How can you realize more value from it through automation?

Bryan Sivak: The StateStat program is incredibly effective in terms of focusing agencies on a set of desired outcomes and rigorously tracking their progress. One of the big challenges, however, is data collection and analysis. Currently, most of the data is collected by hand and entered into Excel spreadsheets for analysis and distribution. This was a great mechanism to get the program up and running, but by building the data warehouse, we will be able to automate a great deal of the data collection and processing that is currently being done manually. We also hope that by connecting data sources directly to the warehouse, we'll be able to get a much more real-time view of the leading indicators and have dashboards that reflect the current moment, as opposed to historical data.

This interview was edited and condensed. Photo by James Duncan Davidson.


November 17 2011

Strata Week: Why ThinkUp matters

Here are a few of the data stories that caught my attention this week.

ThinkUp hits 1.0

ThinkUpThinkUp, a tool out of Expert Labs, enables users to archive, search and export their Twitter, Facebook and Google+ history — both posts and post replies. It also allows users to see their network activity, including new followers, and to map that information. Originally created by Gina Trapani, ThinkUp is free and open source, and will run on a user's own web server.

That's crucial, says Expert Labs' founder Anil Dash, who describes ThinkUp's launch as "software that matters." He writes that "ThinkUp's launch matters to me because of what it represents: The web we were promised we would have. The web that I fell in love with, and that has given me so much. A web that we can hack, and tweak, and own." Imagine everything you've ever written on Twitter, every status update on Facebook, every message on Google+ and every response you've had to those posts — imagine them wiped out by the companies that control those social networks.

Why would I ascribe such awful behavior to the nice people who run these social networks? Because history shows us that it happens. Over and over and over. The clips uploaded to Google Videos, the sites published to Geocities, the entire relationships that began and ended on Friendster: They're all gone. Some kind-hearted folks are trying to archive those things for the record, and that's wonderful. But what about the record for your life, a private version that's not for sharing with the world, but that preserves the information or ideas or moments that you care about?

It's in light of this, no doubt, that ReadWriteWeb's Jon Mitchell calls ThinkUp "the social media management tool that matters most." Indeed, as we pour more of our lives into these social sites, tools like ThinkUp, along with endeavors like the Locker Project, mark important efforts to help people own, control and utilize their own data.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

DataSift opens up its Twitter firehose

DataSiftDataSift, one of only two companies licensed by Twitter to syndicate its firehose (the other being Gnip), officially opened to the public this week. That means that those using DataSift can in turn mine all the social data that comes from Twitter — data that comes at a rate of some 250 million tweets per day. DataSift's customers can analyze this data for more than just keyword searches and can apply various filters, including demographic information, sentiment, gender, and even Klout score. The company also offers data from MySpace and plans to add Google+ and Facebook data soon.

DataSift, which was founded by Tweetmeme's Nick Halstead and raised $6 million earlier this year, is available as a pay-as-you-go subscription model.

Google's BigQuery service opens to more developers

Google announced this week that it was letting more companies have access to its piloting of BigQuery, its big data analytics service. The tool was initially developed for internal use at Google, and it was opened to a limited number of developers and companies at Google I/O earlier this year. Now, Google is allowing a few more companies into the fold (you can indicate your interest here), offering them the service for free — with the promise to notify them in 30 days if it plans to charge — as well as adding some user interface improvements.

In addition to a GUI for the web-based version, Google has improved the REST API for BigQuery as well. The new API offers granular control over permissions and lets you run multiple jobs in the background.

BigQuery is based on the Google tool formerly known as Dremel, which the company discussed in a research paper published last year:

[Dremel] is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data.

In the blog post announcing the changes to BigQuery, Google cites Michael J. Franklin, Professor of Computer Science at UC Berkeley, who calls BigQuery's ability to process big data "jaw-dropping."

Got data news?

Feel free to email me.


November 09 2011

Social network analysis isn't just for social networks

Social networking has become a pervasive part of our everyday online experience, and by extension, that means the analysis and application of social data is an essential component of business.

In the following interview, "Social Network Analysis for Startups" co-author Maksim Tsvetovat (@maksim2042) offers a primer on social network analysis (SNA) and how it has relevance beyond social-networking services.

What is social network analysis (SNA)?

Maksim Tsvetovat: Social network analysis is an offshoot of the social sciences — sociology, political science, psychology, anthropology and others — that studies human interactions by using graph-theoretic approaches rather then traditional statistics. It's a scientific methodology for data analysis and also a collection of theories about how and why people interact — and how these interaction patterns change and affect our lives as individuals or societies. The theories come from a variety of social sciences, but they are always backed up with mathematical ways of measuring if a specific theory is applicable to a specific set of data.

In the science world, the field is considered interdisciplinary, so gatherings draw mathematicians, physicists, computer scientists, sociologists, political scientists and even an occasional rock musician.

As far as the technology aspect goes, the analysis methods are embodied in a set of software tools, such as the Python-based NetworkX library, which the book uses extensively. These tools can be used for analyzing and visualizing network data in a variety of contexts, from visualizing the spread of disease to business intelligence applications.

In terms of marketing applications, there's plenty of science behind "why things go viral" — and the book goes briefly into it — but I find that it's best to leave marketing to marketing professionals.

Does SNA refer specifically to the major social-networking services, or does it also apply beyond them?

Maksim Tsvetovat: SNA refers to the study of relationships between people, companies, organizations, websites, etc. If we have a set of relationships that may be forming a meaningful pattern, we can use SNA methods to make sense of it.

Major social-networking services are a great source of data for SNA, and they present some very interesting questions — most recently, how can a social network act as an early warning system for natural disasters? I'm also intrigued by the emergent role of Twitter as a "common carrier" and aggregation technology for data from other media. However, the analysis methodology is applicable to many other data sources. In fact, I purposefully avoided using Twitter as a data source in the book — it's the obvious place to start and also a good place to get tunnel vision about the technology.

Instead, I concentrated on getting and analyzing data from other sources, including campaign finance, startup company funding rounds, international treaties, etc., to demonstrate the potential breadth of applications of this technology.

Social Network Analysis for Startups — Social network analysis (SNA) is a discipline that predates Facebook and Twitter by 30 years. Through expert SNA researchers, you'll learn concepts and techniques for recognizing patterns in social media, political groups, companies, cultural trends, and interpersonal networks.

Today Only Get "Social Network Analysis for Startups" for $9.99 (save 50%).

How does SNA relate to startups?

Maksim Tsvetovat: A lot of startups these days talk about social-this and social-that — and all of their activity can be measured and understood using SNA metrics. Being able to integrate SNA into their internal business intelligence toolkits should make businesses more attuned to their audiences.

I have personally worked with three startups that used SNA to fine-tune their social media targeting strategies by locating individuals and communities, and addressing them directly. Also, my methodologies have been used by a few large firms: the digital marketing agency DIGITAS is using SNA daily for a variety of high-profile clients. (Disclosure: my startup firm, DeepMile Networks, is involved in supplying SNA tools and services to DIGITAS and a number of others.)

What SNA shifts should developers watch for in the near future?

Maksim Tsvetovat: Multi-mode network analysis, which is analyzing networks with many types of "actors" (people, organizations, resources, governments, etc.). I approach the topic briefly in the book — but much remains to be done.

Also, watch for more real-time analysis. Most SNA is done on snapshot-style data that is, at best, a few hours out-of-date — some is years out-of-date. The release of Twitter's Storm tool should spur developers to make more SNA tools work on real-time and flowing data.

This interview was edited and condensed.

Associated photo on home and category pages: bulletin board [before there was twitter] by woodleywonderworks, on Flickr.


Reposted byRK RK

September 15 2011

Global Adaptation Index enables better data-driven decisions

The launch of the Global Adaptation Index (GaIn) literally puts a powerful open data browser into the hands of anyone with a connected mobile device. The index rates a given country's vulnerability to environmental shifts precipitated by climate change, its readiness to adapt to such changes, and its ability to utilize investment capital that would address the state of those vulnerabilities.

Global Adaptation Index

The Global Adaptation Index combines development indicators from 161 countries into a map that provides quick access to thousands of open data records. All of the data visualizations at are powered by indicators that are openly available and downloadable under a Creative Commons license.

"All of the technology that we're using is a way to bring this information close to society," said Bruno Sanchez-Andrade Nuño, the director of science and technology at the Global Adaptation Institute (GAI), the organization that launched the index.

Open data, open methodology

The project was helped by the World Bank's move to open data, including the release of its full development database. "All data is from sources that are already open," said Ian Noble, chief scientist at GAI. "We would not use any data that had restrictions. We can point people through to the data source and encourage them to download the data."

Being open in this manner is "the most effective way of testing and improving the index," said Noble. "We have to be certain that data is from a quality, authoritative source and be able to give you an immediate source for it, like the FAO, WHO or disaster database."

"It's not only the data that's open, but also our methodology," said Nuño. " is a really good base, with something like 70% of our data going through that portal. With some of the rest of the data, we see lots of gaps. We're trying to make all values consistent.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Node.js powers the data browser

"This initiative is a big deal in the open data space as it shows a maturing from doing open data hacking competitions to powering a portal that will help channel billions of investment dollars over the next several years," said Development Seed associate Bonnie Bugle in a prepared statement. Development Seed built the site with open source tools, including Node.js and CouchDB.

The choice of Node is a useful indicator, in terms of where the cutting edge of open source technology is moving. "The most important breakthrough is moving beyond PHP and Drupal — our initial thought — to Node.js," said Nuño. "Drupal and PHP are robust and well known, but this seems like the next big thing. We really wanted to push the limits of what's possible. Node.js is faster and allows for more connections. If you navigate countries using the data browser, you're just two clicks away from the source data. It doesn't feel like a web page. It feels native."

Speed of access and interoperability were important considerations, said Nuño. "It works on an iOS device or on a slow connection, like GPRS." Noble said he had even accessed it from rural Australia using an iPad.

Highlights from the GAI press conference are available in the following video:

Global Adaptation Index Press Conference: Data Browser Launched from Development Seed on Vimeo.


September 07 2011

Look at Cook sets a high bar for open government data visualizations

Every month, more open government data is available online. Open government data is being used in mobile apps, baked into search engines or incorporated into powerful data visualizations. An important part of that trend is that local governments are becoming data suppliers.

For local, state and federal governments, however, releasing data is not enough. Someone has to put it to work, pulling the data together to create cohesive stories so citizens and other stakeholders can gain more knowledge. Sometimes this work is performed by public servants, though data visualization and user experience design has historically not been the strong suit of government employees. In the hands of skilled developers and designers, however, open data can be used to tell powerful stories.

One of the best recent efforts at visualizing local open government data can be found at Look at Cook, which tracks government budgets and expenditures from 1993-2011 in Cook County, Illinois.


The site was designed and developed by Derek Eder and Nick Rougeux, in collaboration with Cook County Commissioner John Fritchey. Below, Eder explains how they built the site, the civic stack tools they applied, and the problems Look at Cook aims to solve.

Why did you build Look at Cook?

Derek Eder: After being installed as a Cook County Commissioner, John Fritchey, along with the rest of the Board of Commissioners, had to tackle a very difficult budget season. He realized that even though the budget books were presented in the best accounting format possible and were also posted online in PDF format, this information was still not friendly to the public. After some internal discussion, one of his staff members, Seth Lavin approached me and Nick Rougeux and asked that we develop a visualization that would let the public easily explore and understand the budget in greater detail. Seth and I had previously connected through some of Chicago's open government social functions, and we were looking for an opportunity for the county and the open government community to collaborate.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

What problems does Look at Cook solve for government?

Derek Eder: Look at Cook shines a light on what's working in the system and what's not. Cook County, along with many other municipalities, has its fair share of problems, but before you can even try to fix any of them, you need to understand what they are. This visualization does exactly that. You can look at the Jail Diversion department in the Public Safety Fund and compare it to the Corrections and Juvenile Detention departments. They have an inverse relationship, and you can actually see one affecting the other between 2005 and 2007. There are probably dozens of other stories like these hidden within the budget data. All that was needed was an easy way to find and correlate them — which anyone can now do with our tool.

Look at Cook visualization example
Is there a relationship between the lower funding for Cook County's Jail Diversion and Crime Prevention division and the higher funding levels for the Department of Corrections and the Juvenile Temporary Detention Center divisions? (Click to enlarge.)

What problems does Look at Cook solve for citizens?

Derek Eder: Working on and now using Look at Cook opened my eyes to what Cook County government does. In Chicago especially, there is a big disconnect between where the county begins and where the city ends. Now I can see that the county runs specific hospitals and jails, maintains highways, and manages dozens of other civic institutions. Additionally, I know how much money it is spending on each, and I can begin to understand just how $3.5 billion dollars are spent every year. If I'm interested, I can take it a step further and start asking questions about why the county spends money on what it does and how it has been distributed over the last 18 years. Examples include:

  • Why did the Clerk of the Circuit Court get a 480% increase in its budget between 2007 and 2008? See the 2008 public safety fund.
  • How is the Cook County Board President going to deal with a 74% decrease in appropriations for 2011? See the 2011 president data.
  • What happened in 2008 when the Secretary of the Board of Commissioners got its funding reallocated to the individual District Commissioners? See the 2008 corporate fund.

As a citizen, I now have a powerful tool for asking these questions and being more involved in my local government.

What data did you use?

Derek Eder: We were given budget data in a fairly raw format as a basic spreadsheet broken down into appropriations and expenditures by department and year. That data went back to 1993. Collectively, we and Commissioner Fritchey's office agreed that clear descriptions of everything were crucial to the success of the site, so his office diligently spent the time to write and collect them. They also made connections between all the data points so we could see what control officer was in charge of what department, and they hunted down the official websites for each department.

What tools did you use to build Look at Cook?

Derek Eder: Our research began with basic charts in Excel to get an initial idea of what the data looked like. Considering the nature of the data, we knew we wanted to show trends over time and let people compare departments, funds, and control officers. This made line and bar charts a natural choice. From there, we created a couple iterations of wireframes and storyboards to get an idea of the visual layout and style. Given our prior technical experience building websites at Webitects, we decided to use free tools like jQuery for front-end functionality and Google Fusion Tables to house the data. We're also big fans of Google Analytics, so we're using it to track how people are using the site.

Specifically, we used:

What design principles did you apply?

Derek Eder: Our guiding principles were clarity and transparency. We were already familiar with other popular visualizations, like the New York Times' federal budget and the Death and Taxes poster from WallStats. While they were intriguing, they seemed to lack some of these traits. We wanted to illustrate the budget in a way that anyone could explore without being an expert in county government. From a visual standpoint, the goal was to present the information professionally and essentially let the visuals get out of the way so the data could be the focus.

We feel that designing with data means that the data should do most of the talking. Effective design encourages people to explore information without making them feel overwhelmed. A good example of this is how we progressively expose more information as people drill down into departments and control officers. Effective design should also create some level of emotional connection with people so they understand what they're seeing. For example, someone may know one of the control officers or have had an experience with one of the departments. This small connection draws their attention to those areas and gets them to ask questions about why things are the way they are.

This interview was edited and condensed.


August 10 2011

FCC contest stimulates development of apps to help keep ISPs honest

Last Friday, the Federal Communications Commission (FCC) announced the winners of its open Internet Challenge.

"The winners of this contest will help ensure continued certainty, innovation and investment" in the broadband sector," said FCC chairman Julius Genachowski at the awards ceremony. "Shining a light on network management practices will ensure that incentives for entrepreneurs and innovators remain strong. They will help deter improper conduct helping ensure that consumers and the marketplace pick winners and losers online, and that websites or applications aren't improperly blocked or slowed."

The contest received twenty four submissions in total, with three winners. MobiPerf, a mobile network measurement tool that runs on Android, iOS, and Windows Mobile devices, won both the People's Choice Award and best overall Open Internet App. MobiPerf collects anonymous network measurement information directly from mobile phones. MobiPerf was designed by a University of Michigan and Microsoft Research team.


Two apps and teams shared the Open Internet Research Award. ShaperProbe, which was originally called, "DiffProbe," is designed to detect service discrimination by Internet service providers (ISPs). ShaperProbe uses the Measurement Lab (M-Lab) research platform. All of the data collected through ShaperProbe will be publicly accessible, according to Georgia Institute of Technology, which developed the app.

Netalyzer is a Web-based Java app that measures and debugs a network. Notably, the Netalyzer Internet traffic analysis tool has a "Mom Mode," which may make it more accessible to people like, well, my own mother. Netalyzer was built by the International Computer Science Institute (ISCI) at the University of California at Berkeley.

More details about the winners and the teams that built them is available at

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

20% on registration with the code STN11RAD

Open Internet questions

It almost goes without saying that this contest carried some baggage at the outset. Last December, the launch of a new contest by the Federal Communications Commission was overshadowed by concerns about what the new FCC open Internet rules could mean for net neutrality, particularly with respect to the mobile space that is of critical interest to many developers. Nonetheless, the FCC open Internet challenge went forward, focused on stimulating the development of apps for network quality of service testing.

Amidst legitmate concerns about the sustainability of apps contests, the outcomes of this Open Internet challenge offers a couple of important data points.

First, the challenge does seem to have stimulated the creation of a new resource for the online community: unlike the other two winners, the MobiPerf app was created for the contest, according to FCC press secretary Neil Grace.

Second, when this challenge launched, collecting more data for better net neutrality was a goal that organizations like the Electronic Frontier Foundation and M-Lab supported. The best answers to questions about filtering or shaping rely "on the public having real knowledge about how our Internet connections are functioning and whether or not ISPs are providing the open Internet that users want," wrote Richard Esguerra.

Now the public has better tools to gather and share that knowledge. Will these apps "shed light" on broadband providers' tactics? As with so many apps, that will depend on whether people *use* them or not. The two winning apps that existed before the contest, Netalyzer and ShaperProbe, have already been used thousands of times, so there's reason to expect more usage. For instance, Netalyzer can be (and was) applied in analyzing widespread search hijacking in the United States. In that context, empowered consumers that can detect and share data about the behavior of their Internet service providers could play a more important role in the broadband services market.

Finally, the FCC has established new ties to the research and development communities at Berkeley, Georgia Tech and other institutions. It connected with the community. Integrating more technical expertise from academia with the regulator's institutional knowledge is an important outcome from the challenge, and not one that is as easily measured as "a new app for that." It's not clear yet whether the outcomes from the Apps for Communities challenge, set to conclude on August 31st, will be as positive.

The expertise and the data collected from these apps might come also in handy if the time ever comes when the regulator has to make a controversial decision about whether a given ISP's service to its users goes beyond "reasonable network management."

Reposted bykrekk krekk

July 26 2011

Real-time data needs to power the business side, not just tech

In 2005, real-time data analysis was being pioneered and predicted to "transform society." A few short years later, the technology is a reality and indeed is changing the way people do business. But Theo Schlossnagle (@postwait), principal and CEO of OmniTI, says we're not quite there yet.

In a recent interview, Schlossnagle said that not only does the current technology allow less-qualified people to analyze data, but that most of the analysis being done is strictly for technical benefit. The real benefit will be realized when the technology is capable of powering real-time business decisions.

Our interview follows.

How has data analysis evolved over the last few years?

TheoSchlossnagle.jpgTheo Schlossnagle: The general field of data analysis has actually devolved over the last few years because the barrier to entry is dramatically lower. You now have a lot of people attempting to analyze data with no sound mathematics background. I personally see a lot of "analysis" happening that is less mature than your run-of-the-mill graduate-level statistics course or even undergraduate-level signal analysis course.

But where does it need to evolve? Storage is cheaper and more readily available than ever before. This leads organizations to store data like its going out of style. This isn't a bad thing, but it causes a significantly lower signal-to-noise ratio. Data analysis techniques going forward will need to evolve much better noise reduction capabilities.

What does real-time data allow that wasn't available before?

Theo Schlossnagle: Real-time data has been around for a long time, so in a lot of ways, it isn't offering anything new. But the tools to process data in real-time have evolved quite a bit. CEP systems now provide a much more accessible approach to dealing with data in real time and building millisecond-granularity real-time systems. In a web application, imagine being about to observe something about a user and make an intelligent decision on that data combined with a larger aggregate data stream — all before before you've delivered the headers back to the user.

What's required to harness real-time analysis?

Theo Schlossnagle: Low-latency messaging infrastructure and a good CEP system. In my work we use either RabbitMQ or ZeroMQ and a whole lot of Esper.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Does there need to be a single person at a company who collects data, analyzes and makes recommendations, or is that something that can be done algorithmically?

Theo Schlossnagle: You need to have analysts, and I think it is critically important to have them report into the business side — marketing, product, CFO, COO — instead of into the engineering side. We should be doing data analysis to make better business decisions. It is vital to make sure we are always supplied with intelligent and rewarding business questions.

A lot of data analysis done today is technical analysis for technical benefit. The real value is when we can take this technology and expertise and start powering better real-time business decisions. Some of the areas doing real-time analysis well in this regard include finance, stock trading, and high-frequency traders.


March 22 2011

Dashboards evolve to meet social and business needs

screenshot_dashboard2.pngDashboard design is adapting. What once was simply an interface between the user and a dataset has now evolved into a full-fledged platform, and in some cases dashboards have become complete enterprise products.

Socialtext, for example, has developed a dashboard that some companies are opting to use in place of a stagnant corporate intranet.

This kind of enterprise integration also is being combined with a more user-centric design, as seen in the latest update from Google Analytics. By building dashboard tools with users in mind from the start, Google has developed a more flexible dashboard solution. In a recent post for Search Engine Land, Daniel Waisberg noted:

This is probably one of the biggest hits of the release: the capability to create multiple dashboards, each containing any set of graphs. This is a much wanted feature, especially for large organizations, where employees have very different needs from the tool. Now dashboards can be set by hierarchy, department, interest or any other rule.

Web 2.0 Expo San Francisco 2011, being held March 28-31, will examine key pieces of the digital economy and the ways you can use important ideas for your own success.

Save 20% on registration with the code WEBSF11RAD

Ken Hilburn, VP of community enablement at Juice Analytics, broke down the basics of great dashboard design in an interview at Strata. He said the key is to combine three basic tiers of function and then socialize them:

Dashboards are in an evolutionary phase right now. I think that the top level is sort of high-level key indicators that bring focus and draw attention, letting the user know that they need to pay attention. Beneath that, there's dimensions and measures that give context to those metrics — do I want to know more about sales, more about global warming, or temperature increases — including the context around that and what those key metrics mean. The third layer would be detailed data for further exploration and communication.

Then, you can wrap those components into a dashboard. You want to have things like global filtering. You want to have the ability to socialize that — you want to be able to take a snapshot or annotate it, or be able to ask people what they think and communicate back and forth.

You can watch the entire interview with Hilburn in the following video:


March 15 2011

Data integration services combine storage and analysis tools

There has been a lot of movement on the data front in recent months, with a strong focus on integrations between data warehousing and analytics tools. In that spirit, yesterday IBM announced its Netezza data warehouse partnership with Revolution R Enterprise, bringing the R statistics language and predictive analytics to the big data warehouse table.

Microsoft and HP have jumped in as well. Microsoft launched the beta of Dryad, DSC, and DryadLINQ in December, and HP bought Vertica in February. HP plans to integrate Vertica into its new-and-improved overall business model, nicely outlined here by Klint Finley for ReadWrite Enterprise.

These sorts of data integration environments will likely become common as data storage and analysis emerge as mainstream business requirements. Derrick Harris touched on this in a post about the IBM-R partnership and the growing focus on integration:

It's not just about storing lots of data, but also about getting the best insights from it and doing so efficiently, and having large silos for each type of data and each group of business stakeholders doesn't really advance either goal.


March 11 2011

A new focus on user-friendly data analysis

BackType2.pngThe ability to easily extract meaning from unwieldy datasets has become something of a Holy Grail in data analytics. Technologies like Hadoop make it possible to parse big datasets, but the process isn't to the point where an average business user can run reports and conduct analysis.

Roger Ehrenberg, managing partner at IA Ventures, touched on the importance of user interface in a recent interview. He noted that in some ways, we might be developing in the wrong direction:

There's a new-found appreciation for an even greater focus on UI and UX. The experience that a consumer has with a product or application — it's almost as if you need to start there and work backwards as opposed to [saying], "Hey, I've got a cool technology or application. Let's see if this thing works," and then hacking together a UI. Oftentimes, the UI is a secondary consideration and the core technology is the primary. But in many ways you almost want to go the reverse.

The gap between technology and user experience is not lost on developers — or investors. BackType, a social analytics company that developed ElephantDB to export data from Hadoop, just brought in $1 million in investment funding. The company's platform serves as an interface for users to measure social media impact.

The day before BackType announced new funding, HootSuite launched a social analytics dashboard that lets users track social brand performance across platforms like Twitter, Facebook and Google. Adobe has also joined the fray with Adobe SocialAnalytics, a service scheduled for later this year that expands on Adobe's SiteCatalyst product and other Adobe Online Marketing Suite tools.

One additional signal to watch: Social data, which is the current focus of most of these companies and dashboards, may ultimately serve as an entry point for different and deeper types of data analysis. Once users get accustomed to asking big questions against big data, they'll likely expand their queries beyond the social realm.


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...