Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 26 2012

March 22 2012

Direct sales uncover hidden trends for publishers

O'Reilly direct sales channelOne of the most important reasons publishers should invest in a direct channel is because of all the data it provides. Retailers are only going to share a certain amount of customer information with you, but when you make the sale yourself, you have full access to the resulting data stream.

As you may already know, when you buy an ebook from, you end up with access to multiple formats of that product. Unlike Amazon, where you only get a Mobi file, or Apple, where you only get an EPUB file, provides both (as well as PDF and oftentimes a couple of others). This gives the customer the freedom of format choice, but it also gives us insight into what our customers prefer. We often look at download trends to see whether PDF is still the most popular format (it is) and whether Mobi or EPUB are gaining momentum (they are). But what we hadn't done was ask our customers a few simple questions to help us better understand their e-reading habits. We addressed those habits in a recent survey. Here are the questions we asked:

  • If you purchase an ebook from, which of the following is the primary device you will read it on? [Choices included laptop, desktop, iOS devices, Android devices, various Kindle models, and other ereaders/tablets.]
  • On which other devices do you plan to view your ebook?
  • If you purchase an ebook from, which of the following is the primary format in which you plan to read the book? [Choices included PDF, EPUB, Mobi, APK and Daisy formats.]
  • What other ebook formats, if any, do you plan to use?

We ran the survey for about a month and the answers might surprise you. Bear in mind that we realize our audience is unique. O'Reilly caters to technology professionals and enthusiasts. Our customers are also often among the earliest of early adopters.

So, what's the primary ereading device used by these early adopters and techno-enthusiasts? Their iPads. That's not shocking, but what's interesting is how only 25% of respondents said the iPad is their primary device. A whopping 46% said their laptop or desktop computer was their primary ereading device.

Despite all the fanfare about Kindles, iPads, tablets and E Ink devices, the bulk of our customers are still reading their ebooks on an old-fashioned laptop or desktop computer. It's also important to note that the most popular format isn't EPUB or Mobi. Approximately half the respondents said PDF is their primary format. When you think about it, this makes a lot of sense. Again, our audience is largely IT practitioners, coding or solving other problems in front of their laptops/desktops, so they like having the content on that same screen. And just about everyone has Adobe Acrobat on their computer, so the PDF format is immediately readable on most of the laptops/desktops our customers touch.

I've spoken with a number of publishers who rely almost exclusively on Amazon data and trends to figure out what their customers want. What a huge mistake. Even though your audience might be considerably different than O'Reilly's, how do you truly know what they want and need if you're relying on an intermediary (with an agenda) to tell you? Your hidden trend might not have anything to do with devices or formats but rather reader/app features or content delivery. If you don't take the time to build a direct channel, you may never know the answers. In fact, without a direct channel, you might not even know the questions that need to be asked.

Joe Wikert (@joewikert) tweeted select stats and findings from O'Reilly's ereader survey.

Associated photo on home and category pages: Straight as an Arrow by Jeremy Vandel, on Flickr

Mini TOC Chicago — Being held April 9, Mini TOC Chicago is a one-day event focusing on Chicago's thriving publishing, tech, and bookish-arts community.

Register to attend Mini TOC Chicago


Sponsored post

January 23 2012

Survey results: How businesses are adopting and dealing with data

On December 7, 2011, we held our fifth Strata Online Conference. This series of free web events brings together analysts, innovators and researchers from a variety of fields. Each conference, we look at a particular facet of the move to big data — from personal analytics, to disruptive startups, to enterprise adoption.

This time, we focused on how businesses are going to embrace big data, and where the challenges lie. It was a perfect opportunity to survey the attendees and get a glimpse into enterprise adoption of big data. Out of the roughly 350 attendees, approximately 100 agreed to give us their feedback on a number of questions we asked. Here are the results.

Some basic facts

While the attendees worked for a mix of commercial, educational, government, and non-profit companies, the vast majority (82%) worked for a commercial, for-profit company.

What kind of organization do you work for?
Click to enlarge.

Most of the attendees' organizations were also fairly large — more than half of them had 500 co-workers, and 22% of them had more than 10,000.

How big is your organization?
Click to enlarge.

We used this demographic information to segment and better analyze the other three questions we asked.

Big data adoption and challenges

We then asked attendees about their journey to big data. Fewer than 20% of them already have a big data solution in place — which we clarified to mean some kind of massive-scale, sharded, NoSQL, parallel data query system that may employ interactivity and machine-assisted data exploration. More than a quarter said they have no plans at this time.

How soon do you expect to implement a big data solution?
Click to enlarge.

While it's relatively early days for adoption, more than 60% of attendees said they were in the process of gathering information on big data and what it meant to them. This is a spurious result at best: we're of course selecting an audience that wants to be an audience. Nevertheless, the volume of attendees and their feedback suggests that deployment is ramping up: if you're a big data vendor, this is the time to be fighting for mindshare.

What's the biggest challenge you see with big data?
Click to enlarge.

When it comes to actually deploying big data, companies have plenty of challenges. The big ones seem to be:

  • Data privacy and governance.
  • Defining what big data actually is.
  • Integrating big data with legacy systems.
  • A lack of big data skills.
  • The cost of tools.

Analyzing a bit further

These results might be informative, but what we really want to know is how they correlate. After all, Strata is a data conference: we'd be remiss if we didn't crunch things a bit!

First, we wondered whether there's a relationship between the size of a company and the kinds of problems it's experiencing with big data.

Obstacles by company size
Click to enlarge.

Our results suggest that governance and skill shortages are problems for larger companies, and that smaller businesses worry much less about data privacy and integrating legacy systems. Cost concerns come largely from mid-sized businesses.

Then we wondered whether adoption is tied to company size.

Big data adoption progress by company size
Click to enlarge.

Among our attendees, smaller firms were ahead of the game: none of the companies larger than 500 employees said they had big data in place today.

We also found that educational, government, and NGO respondents didn't list cost as a top concern, suggesting that they may have a tolerance for open-source or home-grown approaches.

Obstacles by company type
Click to enlarge.

Of course, the number of responses from these segments isn't statistically significant, but it warrants further study, particularly for commercial offerings trying to sell outside the for-profit world.

Finally, we wondered whether the things a company worries about change as it goes from "just browsing" to "trying to build."

Obstacles by time to implement
Click to enlarge.

Concerns do seem to shift over the course of adoption and maturity. Early on, companies struggle to define what big data is and worry about staffing. As they get closer to implementation, their attention shifts to legacy system integration. Once they have a system, talent shortages and a variety of other, more specific concerns emerge.

While not a hard-core study — respondents weren't randomly selected, the number of responses within some segments isn't statistically significant, and so on — this feedback does suggest that there's a large demand for clear information on what big data is and how it'll change business, and that as enterprises move to adopt these technologies they'll face integration headaches and staffing issues.

The next free Strata Online Conference will be held on January 25. We'll be taking a look at what's in store for the upcoming Strata Conference (Feb 28-March 1 in Santa Clara, Calif).

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


January 19 2012

Strata Week: A home for negative and null results

Here are a few of the data stories that caught my attention this week:

Figshare sees the upside of negative results

FigshareScience data-sharing site Figshare relaunched its website this week, adding several new features. Figshare lets researchers publish all of their data online, including negative and null results.

Using the site, researchers can now upload and publish all file formats, including videos and datasets that are often deemed "supplemental materials" or excluded from current publishing models. This is part of a larger "open science" effort. According to Figshare:

"... by opening up the peer review process, researchers can easily publish null results, avoiding the file drawer effect and helping to make scientific research more efficient. Figshare uses creative commons licensing to allow frictionless sharing of research data whilst allowing users to maintain their ownership."

As the startup argues: "Unless we as scientists publish all of our data, we will never achieve access to the sum of all scientific knowledge."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Accel's $100 million data fund makes its first ($52.5 million) investment

Late last year, the investment firm Accel Partners announced a new $100 Million Big Data Fund, with a promise to invest in big data startups. This year, the first investment from that fund was revealed, with a whopping $52.5 million going to Code 42.

Founded in 2001, Code 42 is the creator of the backup software CrashPlan, and the company describes itself as building "high-performance hardware and easy-to-use software solutions that protect the world's data."

Describing the investment, GigaOm's Stacey Higginbotham writes:

"With the growth in mobile devices and the data stored on corporate and consumer networks that is moving not only from device to server, but device to device, [CEO Matthew] Dornquast realized Code 42's software could become more than just a backup and sharing service, but a way for corporations to understand what data and how data was moving between employees and the devices they use."

Higginbotham also cites Accel Partners' Ping Li, who notes that further investments from its Big Data Fund are unlikely to be so sizable.

LinkedIn open sources DataFu

LinkedInLinkedIn has been a heavy user of Apache Pig for performing analysis with Hadoop on projects such as its People You May Know tool, among other things. For more advanced tasks like these, Pig supports User Defined Functions (UDFs), which allow the integration of custom code into scripts.

This week, LinkedIn announced the release of DataFu, the consolidation of its UDFs into a single, general-purpose library. DataFu enables users to "run PageRank on a large number of independent graphs, perform set operations such as intersect and union, compute the haversine distance between two points on the globe," and more.

LinkedIn is making DataFu available on GitHub under the Apache 2.0 license.

Got data news?

Feel free to email me.


January 04 2012

The feedback economy

Military strategist John Boyd spent a lot of time understanding how to win battles. Building on his experience as a fighter pilot, he broke down the process of observing and reacting into something called an Observe, Orient, Decide, and Act (OODA) loop. Combat, he realized, consisted of observing your circumstances, orienting yourself to your enemy's way of thinking and your environment, deciding on a course of action, and then acting on it.

OODA chart
The Observe, Orient, Decide, and Act (OODA) loop. Click to enlarge.

The most important part of this loop isn't included in the OODA acronym, however. It's the fact that it's a loop. The results of earlier actions feed back into later, hopefully wiser, ones. Over time, the fighter "gets inside" their opponent's loop, outsmarting and outmaneuvering them. The system learns.

Boyd's genius was to realize that winning requires two things: being able to collect and analyze information better, and being able to act on that information faster, incorporating what's learned into the next iteration. Today, what Boyd learned in a cockpit applies to nearly everything we do.

Data-obese, digital-fast

In our always-on lives we're flooded with cheap, abundant information. We need to capture and analyze it well, separating digital wheat from digital chaff, identifying meaningful undercurrents while ignoring meaningless social flotsam. Clay Johnson argues that we need to go on an information diet, and makes a good case for conscious consumption. In an era of information obesity, we need to eat better. There's a reason they call it a feed, after all.

It's not just an overabundance of data that makes Boyd's insights vital. In the last 20 years, much of human interaction has shifted from atoms to bits. When interactions become digital, they become instantaneous, interactive, and easily copied. It's as easy to tell the world as to tell a friend, and a day's shopping is reduced to a few clicks.

The move from atoms to bits reduces the coefficient of friction of entire industries to zero. Teenagers shun e-mail as too slow, opting for instant messages. The digitization of our world means that trips around the OODA loop happen faster than ever, and continue to accelerate.

We're drowning in data. Bits are faster than atoms. Our jungle-surplus wetware can't keep up. At least, not without Boyd's help. In a society where every person, tethered to their smartphone, is both a sensor and an end node, we need better ways to observe and orient, whether we're at home or at work, solving the world's problems or planning a play date. And we need to be constantly deciding, acting, and experimenting, feeding what we learn back into future behavior.

We're entering a feedback economy.

The big data supply chain

Consider how a company collects, analyzes, and acts on data.

The big data supply chain
The big data supply chain. Click to enlarge.

Let's look at these components in order.

Data collection

The first step in a data supply chain is to get the data in the first place.

Information comes in from a variety of sources, both public and private. We're a promiscuous society online, and with the advent of low-cost data marketplaces, it's possible to get nearly any nugget of data relatively affordably. From social network sentiment, to weather reports, to economic indicators, public information is grist for the big data mill. Alongside this, we have organization-specific data such as retail traffic, call center volumes, product recalls, or customer loyalty indicators.

The legality of collection is perhaps more restrictive than getting the data in the first place. Some data is heavily regulated — HIPAA governs healthcare, while PCI restricts financial transactions. In other cases, the act of combining data may be illegal because it generates personally identifiable information (PII). For example, courts have ruled differently on whether IP addresses aren't PII, and the California Supreme Court ruled that zip codes are. Navigating these regulations imposes some serious constraints on what can be collected and how it can be combined.

The era of ubiquitous computing means that everyone is a potential source of data, too. A modern smartphone can sense light, sound, motion, location, nearby networks and devices, and more, making it a perfect data collector. As consumers opt into loyalty programs and install applications, they become sensors that can feed the data supply chain.

In big data, the collection is often challenging because of the sheer volume of information, or the speed with which it arrives, both of which demand new approaches and architectures.

Ingesting and cleaning

Once the data is collected, it must be ingested. In traditional business intelligence (BI) parlance, this is known as Extract, Transform, and Load (ETL): the act of putting the right information into the correct tables of a database schema and manipulating certain fields to make them easier to work with.

One of the distinguishing characteristics of big data, however, is that the data is often unstructured. That means we don't know the inherent schema of the information before we start to analyze it. We may still transform the information — replacing an IP address with the name of a city, for example, or anonymizing certain fields with a one-way hash function — but we may hold onto the original data and only define its structure as we analyze it.


The information we've ingested needs to be analyzed by people and machines. That means hardware, in the form of computing, storage, and networks. Big data doesn't change this, but it does change how it's used. Virtualization, for example, allows operators to spin up many machines temporarily, then destroy them once the processing is over.

Cloud computing is also a boon to big data. Paying by consumption destroys the barriers to entry that would prohibit many organizations from playing with large datasets, because there's no up-front investment. In many ways, big data gives clouds something to do.


Where big data is new is in the platforms and frameworks we create to crunch large amounts of information quickly. One way to speed up data analysis is to break the data into chunks that can be analyzed in parallel. Another is to build a pipeline of processing steps, each optimized for a particular task.

Big data is often about fast results, rather than simply crunching a large amount of information. That's important for two reasons:

  1. Much of the big data work going on today is related to user interfaces and the web. Suggesting what books someone will enjoy, or delivering search results, or finding the best flight, requires an answer in the time it takes a page to load. The only way to accomplish this is to spread out the task, which is one of the reasons why Google has nearly a million servers.
  2. We analyze unstructured data iteratively. As we first explore a dataset, we don't know which dimensions matter. What if we segment by age? Filter by country? Sort by purchase price? Split the results by gender? This kind of "what if" analysis is exploratory in nature, and analysts are only as productive as their ability to explore freely. Big data may be big. But if it's not fast, it's unintelligible.

Much of the hype around big data companies today is a result of the retooling of enterprise BI. For decades, companies have relied on structured relational databases and data warehouses — many of them can't handle the exploration, lack of structure, speed, and massive sizes of big data applications.

Machine learning

One way to think about big data is that it's "more data than you can go through by hand." For much of the data we want to analyze today, we need a machine's help.

Part of that help happens at ingestion. For example, natural language processing tries to read unstructured text and deduce what it means: Was this Twitter user happy or sad? Is this call center recording good, or was the customer angry?

Machine learning is important elsewhere in the data supply chain. When we analyze information, we're trying to find signal within the noise, to discern patterns. Humans can't find signal well by themselves. Just as astronomers use algorithms to scan the night's sky for signals, then verify any promising anomalies themselves, so to can data analysts use machines to find interesting dimensions, groupings, or patterns within the data. Machines can work at a lower signal-to-noise ratio than people.

Human exploration

While machine learning is an important tool to the data analyst, there's no substitute for human eyes and ears. Displaying the data in human-readable form is hard work, stretching the limits of multi-dimensional visualization. While most analysts work with spreadsheets or simple query languages today, that's changing.

Creve Maples, an early advocate of better computer interaction, designs systems that take dozens of independent, data sources and displays them in navigable 3D environments, complete with sound and other cues. Maples' studies show that when we feed an analyst data in this way, they can often find answers in minutes instead of months.

This kind of interactivity requires the speed and parallelism explained above, as well as new interfaces and multi-sensory environments that allow an analyst to work alongside the machine, immersed in the data.


Big data takes a lot of storage. In addition to the actual information in its raw form, there's the transformed information; the virtual machines used to crunch it; the schemas and tables resulting from analysis; and the many formats that legacy tools require so they can work alongside new technology. Often, storage is a combination of cloud and on-premise storage, using traditional flat-file and relational databases alongside more recent, post-SQL storage systems.

During and after analysis, the big data supply chain needs a warehouse. Comparing year-on-year progress or changes over time means we have to keep copies of everything, along with the algorithms and queries with which we analyzed it.

Sharing and acting

All of this analysis isn't much good if we can't act on it. As with collection, this isn't simply a technical matter — it involves legislation, organizational politics, and a willingness to experiment. The data might be shared openly with the world, or closely guarded.

The best companies tie big data results into everything from hiring and firing decisions, to strategic planning, to market positioning. While it's easy to buy into big data technology, it's far harder to shift an organization's culture. In many ways, big data adoption isn't a hardware retirement issue, it's an employee retirement one.

We've seen similar resistance to change each time there's a big change in information technology. Mainframes, client-server computing, packet-based networks, and the web all had their detractors. A NASA study into the failure of Ada, the first object-oriented language, concluded that proponents had over-promised, and there was a lack of a supporting ecosystem to help the new language flourish. Big data, and its close cousin, cloud computing, are likely to encounter similar obstacles.

A big data mindset is one of experimentation, of taking measured risks and assessing their impact quickly. It's similar to the Lean Startup movement, which advocates fast, iterative learning and tight links to customers. But while a small startup can be lean because it's nascent and close to its market, a big organization needs big data and an OODA loop to react well and iterate fast.

The big data supply chain is the organizational OODA loop. It's the big business answer to the lean startup.

Measuring and collecting feedback

Just as John Boyd's OODA loop is mostly about the loop, so big data is mostly about feedback. Simply analyzing information isn't particularly useful. To work, the organization has to choose a course of action from the results, then observe what happens and use that information to collect new data or analyze things in a different way. It's a process of continuous optimization that affects every facet of a business.

Replacing everything with data

Software is eating the world. Verticals like publishing, music, real estate and banking once had strong barriers to entry. Now they've been entirely disrupted by the elimination of middlemen. The last film projector rolled off the line in 2011: movies are now digital from camera to projector. The Post Office stumbles because nobody writes letters, even as Federal Express becomes the planet's supply chain.

Companies that get themselves on a feedback footing will dominate their industries, building better things faster for less money. Those that don't are already the walking dead, and will soon be little more than case studies and colorful anecdotes. Big data, new interfaces, and ubiquitous computing are tectonic shifts in the way we live and work.

A feedback economy

Big data, continuous optimization, and replacing everything with data pave the way for something far larger, and far more important, than simple business efficiency. They usher in a new era for humanity, with all its warts and glory. They herald the arrival of the feedback economy.

The efficiencies and optimizations that come from constant, iterative feedback will soon become the norm for businesses and governments. We're moving beyond an information economy. Information on its own isn't an advantage, anyway. Instead, this is the era of the feedback economy, and Boyd is, in many ways, the first feedback economist.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


Reposted bydatenwolf datenwolf

December 15 2011

Where is the OkCupid for elections?

OK Candidate and ElectnextTo date, we've generally been more adept at collecting and storing data than making sense of it. The companies, individuals and governments that become the most adept at data analysis are doing more than finding the signal in the noise: They are creating a strategic capability. Sometimes, the data comes from unexpected directions. For instance, OkCupid's approach to dating with data has earned it millions of users. In the process, OkCupid has gained great insight into the dynamics of dating in the 21st century, which it then shared on its blog.

Based upon their success, I wondered aloud at this year's Newsfoo whether a similar data-driven web app could be built to help citizens match themselves up with candidates:

After Tim tweeted the observation, I quickly learned two things:

  1. Albert Sun, Daniel Bachhuber, Ashwin Shandilya and Jay Zalowitz had built exactly that app at the 2011 Times Open Hack Day on the day I posed the question. OkCandidate is a web app that matches up a citizen with a Republican presidential candidate. (There's no comparable matching engine for Barack Obama, perhaps given that Democrats expect that the current incumbent of the White House will be the Democratic Party's nominee in 2012.) OkCandidate presents a straightforward series of questions about a wide range of core foreign and domestic issues with ratings to allow the user to rank the importance of agreeing with a given candidate. The app is open source, so if you want to try to improve the code, click on over to OkCandidate on GitHub.
  2. ElectNext, a Philadelphia-based startup, has focused on solving this problem. The "eHarmony for voters," as TechCrunch describes it, aims to match you to your candidate. I also learned that ElectNext won the Judges' Choice Award at the 2011 Web 2.0 Expo/NY Startup Showcase. In the video below, Joanne Wilson and Mo Koyfman discuss the startup from a venture capitalist's perspective.

The politics of big data

Creating a better issue-matching engine for voters and candidates is a genuinely useful civic function. The not-so-hidden opportunity here, however, may be to gather a rich dataset from those choices in precisely the same way that OkCupid has done for dating. That's clearly part of the mindset here: "The data on individual users we don't share with anyone," ElectNext founder Keya Danenbaum told Fast Company. "But the way we foresee using all this information we're collecting is ... eventually to aggregate that and say something really interesting in a poll type of report."

How news organizations and campaigns alike collect, store and analyze data is going to matter much more. Close watchers of the intersection of politics and technology already think the Obama campaign's data crunching may help the president win re-election. As Personal Democracy Media co-founder Micah Sifry put it back in April, "it's the data, stupid."

Big data is "powering the race for the White House," wrote Patrick Ruffini, president of Engage, an interactive agency in D.C.:

The hottest job in today's Presidential campaigns is the Data Mining Scientist — whose job it is to sort through terabytes of data and billions of behaviors tracked in voter files, consumer databases, and site logs. They'll use the numbers to uncover hidden patterns that predict how you'll vote, if you'll pony up with a donation, and if you'll influence your friends to support a candidate.

Alistair Croll, the co-chair of the Strata Conference, thinks it's a strategic capability. "After Eisenhower, you couldn't win an election without radio," he told me at Strata, Calif., in February. "After JFK, you couldn't win an election without television. After Obama, you couldn't win an election without social networking. I predict that in 2012, you won't be able to win an election without big data."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


November 30 2011

Big data goes to work

Companies that are slow to adopt data-driven practices don't need to worry about long-term plans — they'll be disrupted out of existence before those deadlines arrive. And even if your business is on the data bandwagon, you shouldn't get too comfortable. Shifts in consumer tolerances and expectations are quickly shaping how businesses apply big data.

Alistair Croll, Strata online program chair, explores these shifts and other data developments in the following interview. Many of these same topics will be discussed at "Moving to Big Data," a free Strata Online Conference being held Dec. 7.

How are consumer expectations about data influencing enterprises?

Alistair CrollAlistair Croll: There are two dimensions. First, consumer tolerance for sharing data has gone way up. I think there's a general realization that shared information isn't always bad: we can use it to understand trends or fight diseases. Recent rulings by the Supreme Court and legislation like the Genetic Information Nondiscrimination Act (GINA) offer some degree of protection. This means it's easier for companies to learn about their customers.

Second, consumers expect that if a company knows about them, it will treat them personally. We're incensed when a vendor that claims to have a personal connection with us treats us anonymously. The pact of sharing is that we demand personalization in return. That means marketers are scrambling to turn what they know about their customers into changes in how they interact with them.

What's the relationship between traditional business intelligence (BI) and big data? Are they adversaries?

Alistair Croll: Big data is a successor to traditional BI, and in that respect, there's bound to be some bloodshed. But both BI and big data are trying to do the same thing: answer questions. If big data gets businesses asking better questions, it's good for everyone.

Big data is different from BI in three main ways:

  1. It's about more data than BI, and this is certainly a traditional definition of big data.
  2. It's about faster data than BI, which means exploration and interactivity, and in some cases delivering results in less time than it takes to load a web page.
  3. It's about unstructured data, which we only decide how to use after we've collected it and need algorithms and interactivity in order to find the patterns it contains.

When traditional BI bumps up against the edges of big, fast, or unstructured, that's when big data takes over. So, it's likely that in a few years we'll ask a business question, and the tools themselves will decide if they can use traditional relational databases and data warehouses or if they should send the task to a different architecture based on its processing requirements.

What's obvious to anyone on either side of the BI/big data fence is that the importance of asking the right questions — and the business value of doing so — has gone way, way up.

How can businesses unlock their data? What's involved in that process?

Alistair Croll: The first step is to ask the right questions. Before, a leader was someone who could convince people to act in the absence of clear evidence. Today, it's someone who knows what questions to ask.

Acting in the absence of clear evidence mattered because we lived in a world of risk and reward. Uncertainty meant we didn't know which course of action to take — and that if we waited until it was obvious, all the profit would have evaporated.

But today, everyone has access to more data than they can handle. There are simply too many possible actions, so the spoils go to the organization that can choose among them. This is similar to the open-source movement: Goldcorp took its geological data on gold deposits — considered the "crown jewels" in the mining industry — and shared it with the world, creating a contest to find rich veins to mine. Today, they're one of the most successful mining companies in the world. That comes from sharing and opening up data, not hoarding it.

Finally, the value often isn't in the data itself; it's in building an organization that can act on it swiftly. Military strategist John Boyd developed the observe, orient, decide and act (OODA) loop, which is a cycle of collecting information and acting that fighter pilots could use to outwit their opponents. Pilots talk of "getting inside" the enemy's OODA loop; companies need to do the same thing.

So, businesses need to do three things:

  1. Learn how to ask the right questions instead of leading by gut feel and politics.
  2. Change how they think about data, opening it up to make the best use of it when appropriate and realizing that there's a risk in being too private.
  3. Tune the organization to iterate more quickly than competitors by collecting, interpreting, and testing information on its markets and customers.
Moving to Big Data: Free Strata Online Conference — In this free online event, being held Dec. 7, 2011, at 9AM Pacific, we'll look at how big data stacks and analytical approaches are gradually finding their way into organizations as well as the roadblocks that can thwart efforts to become more data driven. (This Strata Online Conference is sponsored by Microsoft.)

Register to attend this free Strata Online Conference

What are the most common data roadblocks in companies?

Alistair Croll: Everyone I talk to says privacy, governance, and compliance. But if you really dig in, it's culture. Employees like being smart, or convincing, or compelling. They've learned soft skills like negotiation, instinct, and so on.

Until now, that's been enough to win friends and influence people. But the harsh light of data threatens existing hierarchies. When you have numbers and tests, you don't need arguments. All those gut instincts are merely hypotheses ripe for testing, and that means the biggest obstacle is actually company culture.

Are most businesses still in the data acquisition phase? Or are you seeing companies shift into data application?

Alistair Croll: These aren't really phases. Companies have a cycle — call it a data supply chain — that consists of collection, interpretation, sharing, and measuring. They've been doing it for structured data for decades: sales by quarter, by region, by product. But they're now collecting more data, without being sure how they'll use it.

We're also seeing them asking questions that can't be answered by traditional means, either because there's too much data to analyze in a timely manner, or because the tools they have can't answer the questions they have. That's bringing them to platforms like Hadoop.

One of the catalysts for this adoption has been web analytics, which is, for many firms, their first taste of big data. And now, marketers are asking, "If I have this kind of insight into my online channels, why can't I get it elsewhere?" Tools once used for loyalty programs and database marketing are being repurposed for campaign management and customer insight.

How will big data shape businesses over the next few years?

Alistair Croll: I like to ask people, "Why do you know more about your friends' vacations (through Facebook or Twitter) than about whether you're going to make your numbers this quarter or where your trucks are?" The consumer web is writing big data checks that enterprise BI simply can't cash.

Where I think we'll see real disruption and adoption is in horizontal applications. The big data limelight is focused on vertical stuff today — genomics, algorithmic trading, and so on. But when it's used to detect employee fraud or to hire and fire the right people, or to optimize a supply chain, then the benefits will be irresistible.

In the last decade, web analytics, CRM, and other applications have found their way into enterprise IT through the side door, in spite of the CIO's allergies to outside tools. These applications are often built on "big data," scale-out architectures.

Which companies are doing data right?

Alistair Croll: Unfortunately, the easy answer is "the new ones." Despite having all the data, Blockbuster lost to Netflix; Barnes & Noble lost to Amazon. It may be that, just like the switch from circuits to packets or from procedural to object-oriented programming, running a data-driven business requires a fundamentally different skill set.

Big firms need to realize that they're sitting on a massive amount of information but are unable to act on it unless they loosen up and start asking the right questions. And they need to realize that big data is a massive disintermediator, from which no industry is safe.

This interview was edited and condensed.


November 04 2011

Top Stories: October 31-November 4, 2011

Here's a look at the top stories published across O'Reilly sites this week.

How I automated my writing career
You scale content businesses by increasing the number of people who create the content ... or so conventional wisdom says. Learn how a former author is using software to simulate and expand human-quality writing.

What does privacy mean in an age of big data?
Ironclad digital privacy isn't realistic, argues "Privacy and Big Data" co-author Terence Craig. What we need instead are laws and commitments founded on transparency.

If your data practices were made public, would you be nervous?
Solon Barocas, a doctoral student at New York University, discusses consumer perceptions of data mining and how companies and data scientists can shape data mining's reputation.

Five ways to improve publishing conferences
Keynotes and panel discussions may not be the best way to program conferences. What if organizers instead structured events more like a great curriculum?

Anthropology extracts the true nature of tech
Genevieve Bell, director of interaction and experience research at Intel, talks about how anthropology can inform business decisions and product design.

Tools of Change for Publishing, being held February 13-15 in New York, is where the publishing and tech industries converge. Register to attend TOC 2012.

October 12 2011

Data in the HR department

Human resources departments are already familiar with data and analytics. These departments track who's hired, who's promoted, who departs and so on. But as organizations become more data driven, new opportunities emerge for HR to put data to use.

In a recent interview, Kathryn Dekas, people analytics manager at Google, discussed the relationship between data and HR. Highlights from the interview (below) included:

  • HR data clearly benefits a company, but Dekas said it can also help employees. "If you know the company [you work for] is using data to make important decisions, it provides an additional layer of trust," she said. "Things are being done based on objective measures over someone's intuition." Moreover, if employees have access to their own HR data, they can then use this information to take ownership of their positions. [Discussed at the 00:42 mark.]
  • Dashboards are often a useful tool, but Dekas said a form of data blindness can creep in after repeated exposure to the same metrics. "What you really need is to disrupt the more typical feedback mechanisms with insights based on questions that are relevant at that moment," Dekas said. "It's easy to become comfortable in sending out metrics regularly, but what you want to do is think about what's timely and relevant." [Discussed at 2:02.]
  • Is there a connection between HR data and the sensor-driven Quantified Self movement? While Dekas said that placing sensors on employees has a "creepiness factor" and isn't likely to happen anytime soon, she did say there are broader ways to view and define workplace sensors, including employee surveys and other feedback mechanisms. She also drew an important distinction between the default state of personal tracking — where you share only if you want to — and the HR environment, where some amount of employee data is shared within an organization. [Discussed at 3:25.]

The full interview is available in the following video:


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...