Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 22 2012

Strata Week: Machine learning vs domain expertise

Here are a few of the data stories that caught my attention this week:

Debating the future of subject area expertise

Data Science Debate panel at Strata CA 12
The "Data Science Debate" panel at Strata California 2012. Watch the debate.

The Oxford-style debate at Strata continues to be one of the most-talked-about events from the conference. This week, it's O'Reilly's Mike Loukides who weighs in with his thoughts on the debate, which had the motion "In data science, domain expertise is more important than machine learning skill." (For those that weren't there, the machine learning side "won." See Mike Driscoll's summary and full video from the debate.)

Loukides moves from the unreasonable effectiveness of data to examine the "unreasonable necessity of subject experts." He writes that:

"Whether you hire subject experts, grow your own, or outsource the problem through the application, data only becomes 'unreasonably effective' through the conversation that takes place after the numbers have been crunched ... We can only take our inexplicable results at face value if we're just going to use them and put them away. Nobody uses data that way. To push through to the next, even more interesting result, we need to understand what our results mean; our second- and third-order results will only be useful when we understand the foundations on which they're based. And that's the real value of a subject matter expert: not just asking the right questions, but understanding the results and finding the story that the data wants to tell. Results are good, but we can't forget that data is ultimately about insight, and insight is inextricably tied to the stories we build from the data. And those stories are going to be ever more essential as we use data to build increasingly complex systems."

Microsoft hires former Yahoo chief scientist

Microsoft has hired Raghu Ramakrishnan as a technical fellow for its Server and Tools Business (STB), reports ZDNet's Mary Jo Foley. According to his new company bio, Ramakrishnan's work will involve "big data and integration between STB's cloud offerings and the Online Services Division's platform assets."

Ramakrishnan comes to Microsoft from Yahoo, where he's been the chief scientist for three divisions — Audience, Cloud Platforms and Search. As Foley notes, Ramakrishnan's move is another indication that Microsoft is serious about "playing up its big data assets." Strata chair Edd Dumbill examined Microsoft's big data strategy earlier this year, noting in particular its work on a Hadoop distribution for Windows server and Azure.

Analyzing the value of social media data

How much is your data worth? The Atlantic's Alexis Madrigal does a little napkin math based on figures from the Internet Advertising Bureau to come up with a broad and ambiguous range between half a cent and $1,200 — depending on how you decide to make the calculation, of course.

In an effort to make those measurements easier and more useful, Google unveiled some additional reports as part of its Analytics product this week. It's a move Google says will help marketers:

"... identify the full value of traffic coming from social sites and measure how they lead to direct conversions or assist in future conversions; understand social activities happening both on and off of your site to help you optimize user engagement and increase social key performance indicators (KPIs); and make better, more efficient data-driven decisions in your social media marketing programs."

Engagement and conversion metrics for each social network will now be trackable through Google Analytics. Partners for this new Social Data Hub, include Disqus, Echo, Reddit, Diigo, and Digg, among others.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.


November 03 2011

The number one trait you want in a data scientist

"Data scientist" is an on-the-rise job title, but what are the skills that make a good one? And how can both data scientists and the companies they work for make sure data-driven insights become actionable?

In a recent interview, DJ Patil (@dpatil), formerly the chief scientist at LinkedIn and now the data scientist in residence at Greylock Partners, discussed common data scientist traits and the challenges that those in the profession face getting their work onto company roadmaps.

Highlights from the interview (below) included:

  • What makes a good data scientist? According to Patil, the number one trait of data scientists is "a passion for really getting to an answer." This does mean, Patil said, that personality might trump skills. Pointing to what he calls "data jujitsu" — the art of turning data into products — he noted that some people can approach a problem "very heavily and very aggressively" using all sorts of computing tools. "But one data scientist who's clever can get results far faster. And typically in a business situation, that's going to have better payoff." Patil pointed to a site like Kaggle, where people compete to solve data problems, and noted that despite the number of data scientists there using machine learning and artificial intelligence, some of them are "getting beat by people who just have good, interesting insights." [Discussed at the 1:34 mark.]
  • Despite the "data smarts and street smarts" that Patil sees as key to data science, data scientists sometimes struggle to get companies to pay attention to the insights data science can provide. The good news is that Patil anticipates this attention issue will fade in the future. Once organizations recognize the importance of data, they'll identify and handle data in better ways. Furthermore, we'll see "a new generation of designers, product managers and GMs who are also data scientists and not just former engineers." [Discussed at 4:09.]

The full interview is available in the following video:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


September 16 2011

Building data science teams

Starting in 2008, Jeff Hammerbacher (@hackingdata) and I sat down to share our experiences building the data and analytics groups at Facebook and LinkedIn. In many ways, that meeting was the start of data science as a distinct professional specialization (see the
"What makes a data scientist" section of this report for the story on how we came up with the title "Data Scientist"). Since then, data science has taken on a life of its own. The hugely positive response to "What Is Data Science?," a great introduction to the meaning of data science in today's world, showed that we were at the start of a movement. There are now regular meetups, well-established startups, and even college curricula focusing on data science. As McKinsey's big data research report and LinkedIn's data indicates, data science talent is in high demand.

This increase in the demand for data scientists has been driven by the success of the major Internet companies. Google, Facebook, LinkedIn, and Amazon have all made their marks by using data creatively: not just warehousing data, but turning it into something of value. Whether that value is a search result, a targeted advertisement, or a list of possible acquaintances, data science is producing products that people want and value. And it's not just Internet companies: Walmart doesn't produce "data products" as such, but they're well known for using data to optimize every aspect of their retail operations.

Given how important data science has grown, it's important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.

Analytics and Data Science Job Growth
Courtesy LinkedIn Corp.

Being data driven

Everyone wants to build a data-driven organization. It's a popular phrase and there are plenty of books, journals, and technical blogs on the topic. But what does it really mean to be "data driven"? My definition is:

A data-driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.

There are many ways to assess whether an organization is data driven. Some like to talk about how much data they generate. Others like to talk about the sophistication of data they use, or the process of internalizing data. I prefer to start by highlighting organizations that use data effectively.

Ecommerce companies have a long history of using data to benefit their organizations. Any good salesman instinctively knows how to suggest further purchases to a customer. With "People who viewed this item also viewed ...," Amazon moved this technique online. This simple implementation of collaborative filtering is one of their most used features; it is a powerful mechanism for serendipity outside of traditional search. This feature has become so popular that there are now variants such as "People who viewed this item bought ... ." If a customer isn't quite satisfied with the product he's looking at, suggest something similar that might be more to his taste. The value to a master retailer is obvious: close the deal if at all possible, and instead of a single purchase, get customers to make two or more purchases by suggesting things they're likely to want. Amazon revolutionized electronic commerce by bringing these techniques online.

Data products are at the heart of social networks. After all, what is a social network if not a huge dataset of users with connections to each other, forming a graph? Perhaps the most important product for a social network is something to help users connect with others. Any new user needs to find friends, acquaintances, or contacts. It's not a good user experience to force users to search for their friends, which is often a surprisingly difficult task. At LinkedIn, we invented People You May Know (PYMK) to solve this problem. It's easy for software to predict that if James knows Mary, and Mary knows John Smith, then James may know John Smith. (Well, conceptually easy. Finding connections in graphs gets tough quickly as the endpoints get farther apart. But solving that problem is what data scientists are for.) But imagine searching for John Smith by name on a network with hundreds of millions of users!

Although PYMK was novel at the time, it has become a critical part of every social network's offering. Facebook not only supports its own version of PYMK, they monitor the time it takes for users to acquire friends. Using sophisticated tracking and analysis technologies, they have identified the time and number of connections it takes to get a user to long-term engagement. If you connect with a few friends, or add friends slowly, you won't stick around for long. By studying the activity levels that lead to commitment, they have designed the site to decrease the time it takes for new users to connect with the critical number of friends.

Netflix does something similar in their online movie business. When you sign up, they strongly encourage you to add to the queue of movies you intend to watch. Their data team has discovered that once you add more than than a certain number of movies, the probability you will be a long-term customer is significantly higher. With this data, Netflix can construct, test, and monitor product flows to maximize the number of new users who exceed the magic number and become long-term customers. They've built a highly optimized registration/trial service that leverages this information to engage the user quickly and efficiently.

Netflix, LinkedIn, and Facebook aren't alone in using customer data to encourage long-term engagement — Zynga isn't just about games. Zynga constantly monitors who their users are and what they are doing, generating an incredible amount of data in the process. By analyzing how people interact with a game over time, they have identified tipping points that lead to a successful game. They know how the probability that users will become long-term changes based on the number of interactions they have with others, the number of buildings they build in the first n days, the number of mobsters they kill in the first m hours, etc. They have figured out the keys to the engagement challenge and have built their product to encourage users to reach those goals. Through continued testing and monitoring, they refined their understanding of these key metrics.

Receive results faster with Aster Data's approach to big data analytics. Learn more.

Google and Amazon pioneered the use of A/B testing to optimize the layout of a web page. For much of the web's history, web designers worked by intuition and instinct. There's nothing wrong with that, but if you make a change to a page, you owe it to yourself to ensure that the change is effective. Do you sell more product? How long does it take for users to find the result they're looking for? How many users give up and go to another site? These questions can only be answered by experimenting, collecting the data, and doing the analysis, all of which are second nature to a data-driven company.

Yahoo has made many important contributions to data science. After observing Google's use of MapReduce to analyze huge datasets, they realized that they needed similar tools for their own business. The result was Hadoop, now one of the most important tools in any data scientist's repertoire. Hadoop has since been commercialized by Cloudera, Hortonworks (a Yahoo spin-off), MapR, and several other companies. Yahoo didn't stop with Hadoop; they have observed the importance of streaming data, an application that Hadoop doesn't handle well, and are working on an open source tool called S4 (still in the early stages) to handle streams effectively.

Payment services, such as PayPal, Visa, American Express, and Square, live and die by their abilities to stay one step ahead of the bad guys. To do so, they use sophisticated fraud detection systems to look for abnormal patterns in incoming data. These systems must be able to react in milliseconds, and their models need to be updated in real time as additional data becomes available. It amounts to looking for a needle in a haystack while the workers keep piling on more hay. We'll go into more details about fraud and security later in this article.

Google and other search engines constantly monitor search relevance metrics to identify areas where people are trying to game the system or where tuning is required to provide a better user experience. The challenge of moving and processing data on Google's scale is immense, perhaps larger than any other company today. To support this challenge, they have had to invent novel technical solutions that range from hardware (e.g., custom computers) to software (e.g., MapReduce) to algorithms (PageRank), much of which has now percolated into open source software projects.

I've found that the strongest data-driven organizations all live by the motto "if you can't measure it, you can't fix it" (a motto I learned from one of the best operations people I've worked with). This mindset gives you a fantastic ability to deliver value to your company by:

  • Instrumenting and collecting as much data as you can. Whether you're doing business intelligence or building products, if you don't collect the data, you can't use it.
  • Measuring in a proactive and timely way. Are your products, and strategies succeeding? If you don't measure the results, how do you know?
  • Getting many people to look at data. Any problems that may be present will become obvious more quickly — "with enough eyes all bugs are shallow."
  • Fostering increased curiosity about why the data has changed or is not changing. In a data-driven organization, everyone is thinking about the data.

It's easy to pretend that you're data driven. But if you get into the mindset to collect and measure everything you can, and think about what the data you've collected means, you'll be ahead of most of the organizations that claim to be data driven. And while I have a lot to say about professional data scientists later in this post, keep in mind that data isn't just for the professionals. Everyone should be looking at the data.

The roles of a data scientist

In every organization I've worked with or advised, I've always found that data scientists have an influence out of proportion to their numbers. The many roles that data scientists can play fall into the following domains.

Decision sciences and business intelligence

Data has long played a role in advising and assisting operational and strategic thinking. One critical aspect of decision-making support is defining, monitoring, and reporting on key metrics. While that may sound easy, there is a real art to defining metrics that help a business better understand its "levers and control knobs." Poorly-chosen metrics can lead to blind spots. Furthermore, metrics must always be used in context with each other. For example, when looking at percentages, it is still important to see the raw numbers. It is also essential that metrics evolve as the sophistication of the business increases. As an analogy, imagine a meteorologist who can only measure temperature. This person's forecast is always going to be of lower quality than the meteorologist who knows how to measure air pressure. And the meteorologist who knows how to use humidity will do even better, and so on.

Once metrics and reporting are established, the dissemination of data is essential. There's a wide array of tools for publishing data, ranging from simple spreadsheets and web forms, to more sophisticated business intelligence products. As tools get more sophisticated, they typically add the ability to annotate and manipulate (e.g., pivot with other data elements) to provide additional insights.

More sophisticated data-driven organizations thrive on the "democratization" of data. Data isn't just the property of an analytics group or senior management. Everyone should have access to as much data as legally possible. Facebook has been a pioneer in this area. They allow anyone to query the company's massive Hadoop-based data store using a language called Hive. This way, nearly anyone can create a personal dashboard by running scripts at regular intervals. Zynga has built something similar, using a completely different set of technologies. They have two copies of their data warehouses. One copy is used for operations where there are strict service-level agreements (SLA) in place to ensure reports and key metrics are always accessible. The other data store can be accessed by many people within the company, with the understanding that performance may not be always optimal. A more traditional model is used by eBay, which uses technologies like Teradata to create cubes of data for each team. These cubes act like self-contained datasets and data stores that the teams can interact with.

As organizations have become increasingly adept with reporting and analysis, there has been increased demand for strategic decision-making using data. We have been calling this new area "decision sciences." These teams delve into existing data sources and meld them with external data sources to understand the competitive landscape, prioritize strategy and tactics, and provide clarity about hypotheses that may arise during strategic planning. A decision sciences team might take on a problem, like which country to expand into next, or it might investigate whether a particular market is saturated. This analysis might, for example, require mixing census data with internal data and then building predictive models that can be tested against existing data or data that needs to be acquired.

One word of caution: people new to data science frequently look for a "silver bullet," some magic number around which they can build their entire system. If you find it, fantastic, but few are so lucky. The best organizations look for levers that they can lean on to maximize utility, and then move on to find additional levers that increase the value of their business.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code ORM30

Product and marketing analytics

Product analytics represents a relatively new use of data. Teams create applications that interact directly with customers, such as:

  • Products that provide highly personalized content (e.g., the ordering/ranking of information in a news feed).
  • Products that help drive the company's value proposition (e.g., "People You May Know" and other applications that suggest friends or other types of connections).
  • Products that facilitate the introduction into other products (e.g., "Groups You May Like," which funnels you into LinkedIn's Groups product area).
  • Products that prevent dead ends (e.g., collaborative filters that suggest further purchases, such as Amazon's "People who viewed this item also viewed ...").
  • Products that are stand alone (e.g., news relevancy products like Google News, LinkedIn Today, etc.).

Given the rapidly decreasing cost of computation, it is easier than ever to use common algorithms and numerical techniques to test the effectiveness of these products.

Similar to product analytics, marketing analytics uses data to explain and showcase a service or product's value proposition. A great example of marketing analytics is OKCupid's blog, which uses internal and external data sources to discuss larger trends. For example, one well-known post correlates the number of sexual partners with smartphone brands. Do iPhone users have more fun? OKCupid knows. Another post studied what kinds of profile pictures are attractive, based on the number of new contacts they generated. In addition to a devoted following, these blog posts are regularly picked up by traditional media, and shared virally through social media channels. The result is a powerful marketing tactic that drives both new users and returning users. Other companies that have used data to drive blogging as a marketing strategy include Mint, LinkedIn, Facebook, and Uber.

Email has long been the basis for online communication with current and potential customers. Using analytics as a part of an email targeting strategy is not new, but powerful analytical technologies can help to create email marketing programs that provide rich content. For example, LinkedIn periodically sends customers updates about changes to their networks: new jobs, significant posts, new connections. This would be spam if it were just a LinkedIn advertisement. But it isn't — it's relevant information about people you already know. Similarly, Facebook uses email to encourage you to come back to the site if you have been inactive. Those emails highlight the activity of your most relevant friends. Since it is hard to delete an email that tells you what your friends are up to, it's extremely effective.

Fraud, abuse, risk and security

Online criminals don't want to be found. They try to hide in the data. There are several key components in the constantly evolving war between attackers and defenders: data collection, detection, mitigation, and forensics. The skills of data scientists are well suited to all of these components.

Any strategy for preventing and detecting fraud and abuse starts with data collection. Data collection is always a challenge, and it is tough to decide how much instrumentation is sufficient. Attackers are always looking to exploit the limitations of your data, but constraints such as cost and storage capacity mean that it's usually impossible to collect all the data you'd like. The ability to recognize which data needs to be collected is essential. There's an inevitable "if only" moment during an attack: "if only we had collected x and y, we'd be able to see what is going on."

Another aspect of incident response is the time required to process data. If an attack is evolving minute by minute, but your processing layer takes hours to analyze the data, you won't be able to respond effectively. Many organizations are finding that they need data scientists, along with sophisticated tooling, to process and analyze data quickly enough to act on it.

Once the attack is understood, the next phase is mitigation. Mitigation usually requires closing an exploit or developing a model that segments bad users from good users. Success in this area requires the ability to take existing data and transform it into new variables that can be acted upon. This is a subtle but critical point. As an example, consider IP addresses. Any logging infrastructure almost certainly collects the IP addresses that connect to your site. Addresses by themselves are of limited use. However, an IP address can be transformed into variables such as:

  • The number of bad actors seen from this address during some period of time.
  • The country from which the address originated, and other geographic information.
  • Whether the address is typical for this time of day.

From this data, we now have derived variables that can be built into a model for an actionable result. Domain experts who are data scientists understand how to make variables out of the data. And from those variables, you can build detectors to find the bad guys.

Finally, forensics builds a case against the attackers and helps you learn about the true nature of the attack and how to prevent (or limit) such attacks in the future. Forensics can be a time-consuming process where the data scientists sift through all of the data to piece together a puzzle. Once the puzzle has been put together, new tooling, processes, and monitoring can be put in place.

Data services and operations

One of the foundational components of any data organization is data services and operations. This team is responsible for the databases, data stores, data structures (e.g., data schemas), and the data warehouse. They are also responsible for the monitoring and upkeep of these systems. The other functional areas cannot exist without a top-notch data services and operations group; you could even say that the other areas live on top of this area. In some organizations, these teams exist independently of traditional operations teams. In my opinion, as these systems increase in sophistication, they need even greater coordination with operations groups. The systems and services this functional area provides need to be deployed in traditional data centers or in the cloud, and they need to be monitored for stability; staff also must be on hand to respond when systems go down. Established operations groups have expertise in these areas, and it makes sense to take advantage of such skills.

As an organization builds out its reporting requirements, the data services and operations team should become responsible for the reporting layer. While team members may not focus on defining metrics, they are critical in ensuring that the reports are delivered in a timely manner. Therefore, collaboration between data services and decision sciences is absolutely essential. For example, while a metric may be easy to define on paper, implementing it as part of a regular report may be unrealistic: the database queries required to implement the metric may be too complex to run as frequently as needed.

Data engineering and infrastructure

It's hard to understate the sophistication of the tools needed to instrument, track, move, and process data at scale. The development and implementation of these technologies is the responsibility of the data engineering and infrastructure team. The technologies have evolved tremendously over the past decade, with an incredible amount of collaboration taking place through open source projects. Here are just a few samples:

  • Kafka, Flume, and Scribe are tools for streaming data collection. While the models differ, the general idea is that these programs collect data from many sources; aggregate the data; and feed it to a database, a system like Hadoop, or other clients.
  • Hadoop is currently the most widely used framework for processing data. Hadoop is an open source implementation of the MapReduce programming model that Google popularized in 2004. It is inherently batch-oriented; several newer technologies are aimed at processing streaming data, such as S4 and Storm.
  • Azkaban and Oozie are job schedulers. They manage and coordinate complex data flows.
  • Pig and Hive are languages for querying large non-relational datastores. Hive is very similar to SQL. Pig is a data-oriented scripting language.
  • Voldemort, Cassandra, and HBase are data stores that have been designed for good performance on very large datasets.

Equally important is the ability to build monitoring and deployment technologies for these systems.

In addition to building the infrastructure, data engineering and infrastructure takes ideas developed by the product and marketing analytics group and implements them so they can operate in production at scale. For example, a recommendation engine for videos may be prototyped using SQL, Pig, or Hive. If testing shows that the recommendation engine is of value, it will need to be deployed so that it supports SLAs specifying appropriate availability and latencies. Migrating the product from prototype into production may require re-implementing it so it can deliver performance at scale. If SQL and a relational database prove to be too slow, you may need to move to HBase, queried by Hive or Pig. Once the application has been deployed, it must be monitored to ensure that it continues meeting its requirements. It must also be monitored to ensure that it is producing relevant results. Doing so requires more sophisticated software development.

Organizational and reporting alignment

Should an organization be structured according to the functional areas I've discussed, or via some other mechanism? There is no easy answer. Key things to consider include the people involved, the size and scale of the organization, and the organizational dynamics of the company (e.g., whether the company is product, marketing, or engineering driven).

In the early stages, people must wear multiple hats. For example, in a startup, you can't afford separate groups for analytics, security, operations, and infrastructure: one or two people may have to do everything. But as an organization grows, people naturally become more specialized. In addition, it's a good idea to remove any single points of failure. Some organizations use a "center-of-excellence model," where there is a centralized data team. Others use a hub-and-spoke model, where there is one central team and members are embedded within sponsoring teams (for example, the sales team may sponsor people in analytics to support their business needs). Some organizations are fully decentralized, and each team hires to fill its own requirements.

As vague as that answer is, here are the three lessons I've learned:

  1. If the team is small, its members should sit close to each other. There are many nuances to working with data, and high-speed interaction between team members resolves painful, trivial issues.
  2. Train people to fish — it only increases your organization's ability to be data driven. As previously discussed, organizations like Facebook and Zynga have democratized data effectively. As a result, these companies have more people conducting more analysis and looking at key metrics. This kind of access was nearly unheard of as little as five years ago. There is a down side: the increased demands on the infrastructure and need for training. The infrastructure challenge is largely a technical problem, and one of the easiest ways to manage training is to set up "office hours" and schedule data classes.
  3. All of the functional areas must stay in regular contact and communication. As the field of data science grows, technology and process innovations will also continue to grow. To keep up to date it is essential for all of these teams to share their experiences. Even if they are not part of the same reporting structure, there is a common bond of data that ties everyone together.

What makes a data scientist?

When Jeff Hammerbacher and I talked about our data science teams, we realized that as our organizations grew, we both had to figure out what to call the people on our teams. "Business analyst" seemed too limiting. "Data analyst" was a contender, but we felt that title might limit what people could do. After all, many of the people on our teams had deep engineering expertise. "Research scientist" was a reasonable job title used by companies like Sun, HP, Xerox, Yahoo, and IBM. However, we felt that most research scientists worked on projects that were futuristic and abstract, and the work was done in labs that were isolated from the product development teams. It might take years for lab research to affect key products, if it ever did. Instead, the focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new.

(Note: Although the term "data science" has a long history — usually referring to business intelligence — "data scientist" appears to be new. Jeff and I have been asking if anyone else has used this term before we coined it, but we've yet to find anyone who has.)

But how do you find data scientists? Whenever someone asks that question, I refer them back to a more fundamental question: what makes a good data scientist? Here is what I look for:

  • Technical expertise: the best data scientists typically have deep expertise in some scientific discipline.
  • Curiosity: a desire to go beneath the surface and discover and distill a problem down into a very clear set of hypotheses that can be tested.
  • Storytelling: the ability to use data to tell a story and to be able to communicate it effectively.
  • Cleverness: the ability to look at a problem in different, creative ways.

People often assume that data scientists need a background in computer science. In my experience, that hasn't been the case: my best data scientists have come from very different backgrounds. The inventor of LinkedIn's People You May Know was an experimental physicist. A computational chemist on my decision sciences team had solved a 100-year-old problem on energy states of water. An oceanographer made major impacts on the way we identify fraud. Perhaps most surprising was the neurosurgeon who turned out to be a wizard at identifying rich underlying trends in the data.

All the top data scientists share an innate sense of curiosity. Their curiosity is broad, and extends well beyond their day-to-day activities. They are interested in understanding many different areas of the company, business, industry, and technology. As a result, they are often able to bring disparate areas together in a novel way. For example, I've seen data scientists look at sales processes and realize that by using data in new ways they can make the sales team far more efficient. I've seen data scientists apply novel DNA sequencing techniques to find patterns of fraud.

What unifies all these people? They all have strong technical backgrounds. Most have advanced degrees (although I've worked with several outstanding data scientists who haven't graduated from college). But the real unifying thread is that all have had to work with a tremendous amount of data before starting to work on the "real" problem. When I was a first-year graduate student, I was interested in weather forecasting. I had an idea about how to understand the complexity of weather, but needed lots of data. Most of the data was available online, but due to its size, the data was in special formats and spread out over many different systems. To make that data useful for my research, I created a system that took over every computer in the department from 1 AM to 8 AM. During that time, it acquired, cleaned, and processed that data. Once done, my final dataset could easily fit in a single computer's RAM. And that's the whole point. The heavy lifting was required before I could start my research. Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation isn't something that gets in the way of solving the problem: it is the problem.

These are some examples of training that hone the skills a data scientist needs to be successful:

  • Finding rich data sources.
  • Working with large volumes of data despite hardware, software, and bandwidth constraints.
  • Cleaning the data and making sure that data is consistent.
  • Melding multiple datasets together.
  • Visualizing that data.
  • Building rich tooling that enables others to work with data effectively.

One of the challenges of identifying data scientists is that there aren't many of them (yet). There are a number of programs that are helping train people, but the demand outstrips the supply. And experiences like my own suggest that the best way to become a data scientist isn't to be trained as a data scientist, but to do serious, data-intensive work in some other discipline.

Hiring data scientists was such a challenge at every place I've worked that I've adopted two models for building and training new hires. First, hire people with diverse backgrounds who have histories of playing with data to create something novel. Second, take incredibly bright and creative people right out of college and put them through a very robust internship program.

Another way to find great data scientists is to run a competition, like Netflix did. The Netflix Prize was a contest organized to improve their ability to predict how much a customer would enjoy a movie. If you don't want to organize your own competition, you can look at people who have performed well in competitions run by others. Kaggle and Topcoder are great resources when looking for this kind of talent. Kaggle has found its own top talent by hiring the best performers from its own competitions.

Hiring and talent

Many people focus on hiring great data scientists, but they leave out the need for continued intellectual and career growth. These key aspects of growth are what I call talent growth. In the three years that I led LinkedIn's analytics and data teams, we developed a philosophy around three principles for hiring and talent growth.

Would we be willing to do a startup with you?

This is the first question we ask ourselves as a team when we meet to evaluate a candidate. It sums up a number of key criteria:

  • Time: If we're willing to do a startup with you, we're agreeing that we'd be willing to be locked in a small room with you for long periods of time. The ability to enjoy another person's company is critical to being able to invest in each other's growth.
  • Trust: Can we trust you? Will we have to look over your shoulder to make sure you're doing an A+ job? That may go without saying, but the reverse is also important: will you trust me? If you don't trust me, we're both in trouble.
  • Communication: Can we communicate with each other quickly and efficiently? If we're going to spend a tremendous amount of time together and if we need to trust each other, we'll need to communicate. Over time, we should be able to anticipate each other's needs in a way that allows us to be highly efficient.

Can you "knock the socks off" of the company in 90 days?

Once the first criteria has been met, it's critical to establish mechanisms to ensure that the candidate will succeed. We do this by setting expectations for the quality of the candidate's work, and by setting expectations for the velocity of his or her progress.

First, the "knock the socks off" part: by setting the goal high, we're asking whether you have the mettle to be part of an elite team. More importantly, it is a way of establishing a handshake for ensuring success. That's where the 90 days comes in. A new hire won't come up with something mind blowing if the team doesn't bring the new hire up to speed quickly. The team needs to orient new hires around existing systems and processes. Similarly, the new hire needs to make the effort to progress, quickly. Does this person ask questions when they get stuck? There are no dumb questions, and toughing it out because you're too proud or insecure to ask is counterproductive. Can the new hire bring a new system up in a day, or does it take a week or more? It's important to understand that doing something mind-blowing in 90 days is a team goal, as much as an individual goal. It is essential to pair the new hire with a successful member of the team. Success is shared.

This criterion sets new hires up for long-term success. Once they've passed the first milestone, they've done something that others in the company can recognize, and they have the confidence that will lead to future achievements. I've seen everyone from interns all the way to seasoned executives meet this criterion. And many of my top people have had multiple successes in their first 90 days.

In four to six years, will you be doing something amazing?

What does it mean to do something amazing? You might be running the team or the company. You might be doing something in a completely different discipline. You may have started a new company that's changing the industry. It's difficult to talk concretely because we're talking about potential and long-term futures. But we all want success to breed success, and I believe we can recognize the people who will help us to become mutually successful.

I don't necessarily expect a new hire to do something amazing while he or she works for me. The four- to six-year horizon allows members of the team to build long-term road maps. Many organizations make the time commitment amorphous by talking about vague, never-ending career ladders. But professionals no longer commit themselves to a single company for the bulk of their careers. With each new generation of professionals, the number of organizations and even careers has increased. So rather than fight it, embrace the fact that people will leave, so long as they leave to do something amazing. What I'm interested in is the potential: if you have that potential, we all win and we all grow together, whether your biggest successes come with my team or somewhere else.

Finally, this criteria is mutual. A new hire won't do something amazing, now or in the future, if the organization he or she works for doesn't hold up its end of the bargain. The organization must provide a platform and opportunities for the individual to be successful. Throwing a new hire into the deep end and expecting success doesn't cut it. Similarly, the individual must make the company successful to elevate the platform that he or she will launch from.

Building the LinkedIn data science team

I'm proud of what we've accomplished in building the LinkedIn data team. However, when we started, it didn't look anything like the organization that is there today. We started with 1.5 engineers (who would later go on to invent Voldemort, Kafka, and the real-time recommendation engine systems), no data services team (there wasn't even a data warehouse), and five analysts (who would later become the core of LinkedIn's data science group) who supported everyone from the CFO to the product managers.

When we started to build the team, the first thing I did was go to many different technical organizations (the likes of Yahoo, eBay, Google, Facebook, Sun, etc.) to get their thoughts and opinions. What I found really surprised me. The companies all had fantastic sets of employees who could be considered "data scientists." However, they were uniformly discouraged. They did first-rate work that they considered critical, but that had very little impact on the organization. They'd finish some analysis or come up with some ideas, and the product managers would say "that's nice, but it's not on our roadmap." As a result, the data scientists developing these ideas were frustrated, and their organizations had trouble capitalizing on what they were capable of doing.

Our solution was to make the data group a full product team responsible for designing, implementing, and maintaining products. As a product team, data scientists could experiment, build, and add value directly to the company. This resulted not only in further development of LinkedIn products like PYMK and Who's Viewed My Profile, but also in features like Skills, which tracks various skills and assembles a picture of what's needed to succeed in any given area, and Career Explorer, which helps you explore different career trajectories.

It's important that our data team wasn't comprised solely of mathematicians and other "data people." It's a fully integrated product group that includes people working in design, web development, engineering, product marketing, and operations. They all understand and work with data, and I consider them all data scientists. We intentionally kept the distinction between different roles in the group blurry. Often, an engineer can have the insight that makes it clear how the product's design should work, or vice-versa — a designer can have the insight that helps the engineers understand how to better use the data. Or it may take someone from marketing to understand what a customer really wants to accomplish.

The silos that have traditionally separated data people from engineering, from design, and from marketing, don't work when you're building data products. I would contend that it is questionable whether those silos work for any kind of product development. But with data, it never works to have a waterfall process in which one group defines the product, another builds visual mock-ups, a data scientist preps the data, and finally a set of engineers builds it to some specification document. We're not building Microsoft Office, or some other product where there's 20-plus years of shared wisdom about how interfaces should work. Every data project is a new experiment, and design is a critical part of that experiment. It's similar for operations: data products present entirely different stresses on a network and storage infrastructure than traditional sites. They capture much more data: petabytes and even exabytes. They deliver results that mash up data from many sources, some internal, some not. You're unlikely to create a data product that is reliable and that performs reasonably well if the product team doesn't incorporate operations from the start. This isn't a simple matter of pushing the prototype from your laptop to a server farm.

Finally, quality assurance (QA) of data products requires a radically different approach. Building test datasets is nontrivial, and it is often impossible to test all of the use cases. As different data streams come together into a final product, all sorts of relevance and precision issues become apparent. To develop this kind of product effectively, the ability to adapt and iterate quickly throughout the product life cycle is essential. To ensure agility, we build small groups to work on specific products, projects, or analyses. When we can, I like to seat anyone with a dependency with another person in the same area.

A data science team isn't just people: it's tooling, processes, the interaction between the team and the rest of the company, and more. At LinkedIn, we couldn't have succeeded if it weren't for the tools we used. When you're working with petabytes of data, you need serious power tools to do the heavy lifting. Some, such as Kafka and Voldemort (now open source projects) were homegrown, not because we thought we should have our own technology, but because we didn't have a choice. Our products couldn't scale without them. In addition to these technologies, we use other open source technologies such as Hadoop and many vendor-supported solutions as well. Many of these are for data warehousing, and traditional business intelligence.

Tools are important because they allow you to automate. Automation frees up time, and makes it possible to do the creative work that leads to great products. Something as simple as reducing the turnaround time on a complex query from "get the result in the morning" to "get the result after a cup of coffee" represents a huge increase in productivity. If queries run overnight, you can only afford to ask questions when you already think you know the answer. If queries run in minutes, you can experiment and be creative.

Interaction between the data science teams and the rest of corporate culture is another key factor. It's easy for a data team (any team, really) to be bombarded by questions and requests. But not all requests are equally important. How do you make sure there's time to think about the big questions and the big problems? How do you balance incoming requests (most of which are tagged "as soon as possible") with long-term goals and projects? It's important to have a culture of prioritization: everyone in the group needs to be able to ask about the priority of incoming requests. Everything can't be urgent.

The result of building a data team is, paradoxically, that you see data products being built in all parts of the company. When the company sees what can be created with data, when it sees the power of being data enabled, you'll see data products appearing everywhere. That's how you know when you've won.


Companies are always looking to reinvent themselves. There's never been a better time: from economic pressures that demand greater efficiency, to new kinds of products that weren't conceivable a few years ago, the opportunities presented by data are tremendous.

But it's a mistake to treat data science teams like any old product group. (It is probably a mistake to treat any old product group like any old product group, but that's another issue.) To build teams that create great data products, you have to find people with the skills and the curiosity to ask the big questions. You have build cross-disciplinary groups with people who are comfortable creating together, who trust each other, and who are willing to help each other be amazing. It's not easy, but if it were easy, it wouldn't be as much fun.


June 29 2011

Citizen science, civic media and radiation data hint at what's to come

SafecastNatural disasters and wars bring people together in unanticipated ways, as they use the tools and technologies easily at hand to help. From crisis response to situational awareness, free or low cost online tools are empowering citizens to do more than donate money or blood: now they can donate, time, expertise or, increasingly, act as sensors. In the United States, we saw a leading edge of this phenomenon in the Gulf of Mexico, where open source oil spill reporting provided a prototype for data collection via smartphone. In Japan, an analogous effort has grown and matured in the wake of the nuclear disaster that resulted from a massive earthquake and subsequent tsunami this spring.

The story of the RDTN project, which has grown into Safecast, a crowdsourced radiation detection network, isn't new, exactly, but it's important.

Radiation monitoring and grassroots mapping in Japan has been going on since April, as Emily Gertz reported at I recently heard more about the Safecast project from Joi Ito at this year's Civic Media conference at the MIT Media Lab, where Ito described his involvement. Ethan Zuckerman blogged Ito's presentation, capturing his thoughts on how the Internet helped cover the Japanese earthquake (Twitter "beat the pants" off the mainstream media on the first day) and the Safecast project's evolution from a Skype chat.

According to Gertz' reporting, Safecast now includes data from a variety of sources, including feeds from the U.S. Environmental Protection Agency, Greenpeace, a volunteer crowdsourcing network in Russia, and the Japanese Ministry of Education, Culture, Sports, Science and Technology. Radiation data that's put into Safecast is made available for others to use via Pachube, an open-source platform for monitoring sensor data.

Ito said that a lot of radiation data that the Japanese government had indicated would be opened up has not been released, prompting the insight that crises, natural or otherwise, are an excellent opportunity to examine how effective an open government data implementation has been. Initially, the RDTN project entered an environment where there was nearly no radiation data available to the public.

"They were releasing data, it was just not very specific," said Sean Bonner, via Skype Interview. Bonner has served as the communications lead for Safecast since the project began. The Japanese government "would release data for some areas and not for others — or rather they didn't have it," he said. "I don't think they had data they weren't releasing. Our point is that the sensors to detect the data were not in place at all. So we decided to help with that."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

A KickStarter campaign in April raised nearly $37,000 to purchase Geiger counters to gather radiation data. Normally, that might be sufficient to obtain dozens of devices, given costs that range from $100 to nearly $1,000 for a professional-grade unit. The challenge is that if Geiger counters weren't easy to get before the Japanese nuclear meltdown, they became nearly impossible to obtain afterwards.

The Safecast project has also hacked together iGeigie, an iPhone-connected Geiger counter that can detect beta and gamma radiation. "The iGeigie is just a concept product, it's not a focus or a main solution," cautioned Bonner. "So a lot of what we've been doing it trying to help cover more ground with single sensors."

Even if they were in broader circulation, Geiger counters are unlikely to detect radiation in food or water. That's where open source hardware and hackerspaces become more relevant, specifically the Arduino boards that Radar and Make readers know well.

"We have Arduinos in the static devices that we are building and connecting to the web," said Bonner. "We're putting those around and they report data back to us." In other words, the Internet of Things is growing.

The sensors Safecast is deploying will capture alpha, beta and gamma radiation. "It's very important to track all three," said Bonner. "The very sensitive devices we are using are commercially produced. [They are] Inspector Alerts, made by International Medcom. Those use the industry standard 2-inch pancake sensor, which we are using in our other devices as well. We are using the same sensors everywhere. "

Citizen science and open data

Open source software and citizens acting as sensors have steadily been integrated into journalism over the past few years, most dramatically in the videos and pictures uploaded after the 2009 Iran election and during this year's Arab Spring. Citizen science looks like the new frontier. "I think the real value of citizen media will be collecting data," said Rich Jones, founder of OpenWatch, a counter-surveillance project that aims to "police the police." Apps like Open Watch can make "analyzing data a revolutionary act," said Justin Jacoby Smith. The development of Oil Reporter, grassroots mapping, Safecast, social networks, powerful connected smartphones and massive online processing power have put us into new territory. In the context of environmental or man-made disasters, collecting or sharing data can also be a civic act.

Crowdsourcing radiation data on Japan does raise legitimate questions about data quality and reporting, as Safecast's own project leads acknowledge.

"We make it very clear on the site that yes, there could most definitely be inaccuracies in crowd-sourced data," Safecast's Marcelino Alvarez told Public Radio International. "And yes, there could be contamination of a particular Geiger counter so the readings could be off," Alvarez said. "But our hope is that with more centers and more data being reported that those points that are outliers can be eliminated, and that trends can be discerned from the data."

The thinking here is that while some data may be inaccurate or some sensors misconfigured, over time the aggregate will skew toward accuracy. "More data is always better than less data," said Bonner. "Data from several sources is more reliable than from one source, by default. Without commenting on the reliability of any specific source, all the other sources help improve the overall data. Open data helps with that."

Safecast is combining open data collected by citizen science with academic, NGO and open government data, where available, and then making it widely available. It's similar to other projects, where public data and experimental data are percolating.

Citizen science can create information orders of magnitude better than Google Maps, said Brian Boyer, news application developer at the Chicago Tribune, referencing the grassroots mapping work of Jeffrey Warren and others. "It's also fun," Boyer said. "You can get lots of people involved who wouldn't otherwise be involved doing a mapping project."

As news of these experiments spreads, the code and policies used to build them will also move with them. The spread of open source software is now being accompanied by open source hardware and maker culture. That will likely have unexpected effects.

When you can't meet demand for a device like a Geiger counter, people will start building their own, said Ito at the MIT Civic Media conference. He's seeing open hardware design spread globally. While there's an embargo on the export of many technologies, "we argue — and win — that open source software is free speech," said Ito. "Open source hardware is the same." If open source software now plays a fundamental role in new media, as evidenced by the 2011 winners of the Knight News Challenge, open source hardware may be supporting democracy in journalism too, says Ito.

Given Ito's success in anticipating (and funding) other technological changes, that's one prediction to watch.


May 09 2011

Why the term "data science" is flawed but useful

Mention "data science" to a lot of the high-profile people you might think practice it and you're likely to see rolling eyes and shaking heads. It has taken me a while, but I've learned to love the term, despite my doubts. The key reason is that the rest of the world understands roughly what I mean when I use it. After years of stumbling through long-winded explanations about what I do, I can now say "I'm a data scientist" and move on. It is still an incredibly hazy definition, but my former descriptions left people confused as well, so this approach is no worse and at least saves time.

With that in mind, here are the arguments I've heard against the term, and why I don't think they should stop its adoption.

It's not a real science

I just finished reading "The Philosophical Breakfast Club," the story of four Victorian friends who created the modern structure of science, as well as inventing the word "scientist." I grew up with the idea that physics, chemistry and biology were the only real sciences and every other subject using the term was just stealing their clothes ("Anything that needs science in the name is not a real science"). The book shows that from the beginning the label was never restricted to just the hard experimental sciences. It was chosen to promote a disciplined approach to reasoning that relied on data rather than the poorly-supported logical deductions many contemporaries favored. Data science fits comfortably in this more open tradition.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

It's an unnecessary label

To me, it's obvious that there has been a massive change in the landscape over the last few years. Data and the tools to process it are suddenly abundant and cheap. Thousands of people are exploiting this change, making things that would have been impossible or impractical before now, using a whole new set of techniques. We need a term to describe this movement, so we can create job ads, conferences, training and books that reach the right people. Those goals might sound very mundane, but without an agreed-upon term we just can't communicate.

The name doesn't even make sense

As a friend said, "show me a science that doesn't involve data." I hate the name myself, but I also know it could be a lot worse. Just look at other fields that suffer under terms like "new archaeology" (now more than 50 years old) or "modernist art" (pushing a century). I learned from teenage bands that the naming process is the most divisive part of any new venture, so my philosophy has always been to take the name you're given, and rely on time and hard work to give it the right associations. Apple and Microsoft (née Micro-soft) are terrible startup names by any objective measure, but they've earned their mindshare. People are calling what we're doing "data science," so lets accept that and focus on moving the subject forward.

There's no definition

This is probably the deepest objection, and the one with the most teeth. There is no widely accepted boundary for what's inside and outside of data science's scope. Is it just a faddish rebranding of statistics? I don't think so, but I also don't have a full definition. I believe that the recent abundance of data has sparked something new in the world, and when I look around I see people with shared characteristics who don't fit into traditional categories. These people tend to work beyond the narrow specialties that dominate the corporate and institutional world, handling everything from finding the data, processing it at scale, visualizing it and writing it up as a story. They also seem to start by looking at what the data can tell them, and then picking interesting threads to follow, rather than the traditional scientist's approach of choosing the problem first and then finding data to shed light on it. I don't know what the eventual consensus will be on the limits of data science, but we're starting to see some outlines emerge.

Time for the community to rally

I'm betting a lot on the persistence of the term. If I'm wrong the Data Science Toolkit will end up sounding as dated as "surfing the information super-highway." I think data science, as a phrase, is here to stay though, whether we like it or not. That means we as a community can either step up and steer its future, or let others exploit its current name recognition and dilute it beyond usefulness. If we don't rally around a workable definition to replace the current vagueness, we'll have lost a powerful tool for explaining our work.


January 25 2011

3 skills a data scientist needs

To prepare for next week's Strata Conference, we're continuing our series of conversations with big data innovators. Today, we talk with LinkedIn senior research scientist Pete Skomoroch about the core skills of data scientists.

The first skill, as you might expect, is a base in statistics, algorithms, machine learning, and mathematics. "You need to have a solid grounding in those principles to actually extract signals from this data and build things with it," Skomoroch said.

Second, a good data scientist is handy with a collection of open-source tools — Hadoop, Java, Python, among others. Knowing when to use those tools, and how to code, are prerequisites.

The third set of skills focus on making products real and making data available to users. "That might mean data visualization, building web prototypes, using external APIs, and integrating with other services," Skomoroch said. In other words, this one's a combination of coding skills, an ability to see where data can add value, and collaborating with teams to make these products a reality.

Skomorich's position gives him insight into the job market, what jobs are being posted, and who is hiring for which roles. He said he's glad to see new startups adding a data scientist or engineer to the founding staff. "That's a good sign."

Skomorich discusses data science skill sets and related topics in the following video:

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD


November 30 2010

The data analysis path is built on curiosity, followed by action

A traditional view of data analysis involves precision, preparation, and methodical examination of defined datasets. Philipp Janert, author of "Data Analysis with Open Source Tools," has a somewhat different perspective. Those traditional elements are still important, but Janert also thinks simplicity, experimentation, action, and natural curiosity all shape effective data work. He expands on these ideas in the following interview.

Is data analysis inherently complicated?

Philipp JanertPhilipp Janert: I observe a tendency to do something complicated and fancy; to bring in a statistical concept and other "sophisticated" stuff. The problem is that the sophisticated stuff isn't that easy to understand.

Why not just look at the data set? Just look at it in an editor. Maybe you'll see something. Or, draw some graphs. Graphs don't require any sort of formal analytical training. These simple methods can be illuminating precisely because you don't need anything complicated, and nothing is hidden.

Why do analysts shy from simplicity?

PJ: I often perceive a great sense of insecurity in my co-workers when it comes to math. Because of that, I get the sense people are trying to almost hide behind complicated methods.

The classic case for me is that usually within the first three minutes of a conversation, people start talking about standard deviations. It's the one concept from classical statistics that everyone has heard of. But contextually, it's not clear what "standard deviation" really means. Are they talking about what's being measured by the standard deviation, namely the width of the distribution? Are they referring to one particular measure and how it's being calculated? Do they mean the conclusions that can be drawn from standard deviations in the Normal case?

We need to keep it simple and not get sucked into abstract concepts that may or may not be fully understood.

What tool or method offers the best starting point for data analysis?

PJ: Start by plotting the data set. Plot all of the data points and look at them. Don't try to calculate indicator quantities or summary statistics. Just look at what you see in the plot. Almost anything worthwhile can be seen in a good graph.

Is there a defined career path for people who want to become data scientists?

Strata ConferencePJ: The stunning development over the 12 months I was writing this book is that "big data" became the thing that's on everybody's mind. All of a sudden, people are really concerned about very large datasets. Of course, this seems to be mostly driven by the social networking phenomenon. But the question is: What do we do with that data?

I know that for my purposes, I never need big data. When I ask people what they do with big data, I've found that it's not what I would call "analysis" at all, because it does not involve the development of conceptual models. It does not involve the inductive/deductive cycle of scientific reasoning.

It falls into one of two camps. The first is reporting. For instance, if a company is being paid based on the number of pages they serve, then counting the number of served pages is important. The resulting log files tend to be huge, so that's technically big data. But it's a very straightforward counting and reporting game.

The other camp is what I consider "generalized search." These are scenarios like: If User A likes movies B, C, and D, what other specific movie might User A want? That's a form of searching because you're not actually trying to create a conceptual model of user behavior. You're comparing individual data points; you're trying to find the movie that has the greatest similarity to a very specific other set of predefined movies. For this kind of generalized, exhaustive search, you need a lot of data because you look for the individual data points. But that's not really analysis as I understand it, either.

So coming back to your original question -- is there a path to becoming a "data scientist?" -- we need to first find out what data science might be. It will encompass different things: the kind of big data I mentioned; reporting and business intelligence; hopefully the kind of conceptual modeling that I do. But depending on what you're trying to accomplish, you could require very different skills.

For what I do -- and this is really the only data analysis I can speak about with any sense of confidence -- the most important skill is curiosity. This sounds a little tacky, but I mean it. Are you curious why the grass is green? Are you curious why is the sky blue? I'm talking about questions of this sort. These are representative of the inquisitive mind of a scientist. If you have that, you're in good shape and you can start anywhere.

The skills and tools of data science will be discussed at the Strata Conference, being held Feb. 1-3 in Santa Clara, Calif. Save 20% off registration with the code SRT11RAD.

Besides curiosity, are there other traits or skills that benefit data analysts?

PJ: You need experience with empirical work. And by that I mean someone who looks at the "idiot lights" on a router to make sure the cable is plugged in before they troubleshoot. We've all been in the situation where you reinstall the IP stack because you can't get network connectivity, and only later did you realize the router wasn't plugged in. These failures of empirical work are critical because empirical skills can be learned.

It's also nice, but not essential, to have taken a college math class and retained a bit. You should learn a programming language as well because you need to know how to manipulate data on your own. Any of the current scripting languages will do.

The last thing is that you need to actually do the work. Find a dataset that you're interested in and work on it. It doesn't have to be fancy, but you have to get started. You can't just sit there and expect it to happen. Experience and practice are really important.

It sounds like the "just start" mindset you find in the Maker/DIY community also applies to data. Is that right?

PJ: I don't know about other people, but I do this because it's fun. And that's a similar mentality to the Make space. They're more about creating something as opposed to understanding something, but the mentality is very much the same.

It's about curiosity followed by action. You look at the dataset and then go deeper to discover something. And this process isn't defined by tools. Personally, I'm interested in what somebody's trying to find rather than if they're using all the right statistical methods.


June 02 2010

What is data science?

We've all heard it: according to Hal Varian, statistics is the next sexy job. Five years ago, in What is Web 2.0, Tim O'Reilly said that "data is the next Intel Inside." But what does that statement mean? Why do we suddenly care about statistics and about data?

In this post, I examine the many sides of data science -- the technologies, the companies and the unique skill sets.


What is data science?

Making Data Work Online ConferenceThe web is full of "data-driven apps." Almost any e-commerce application is a data-driven application. There's a database behind a web front end, and middleware that talks to a number of other databases and data services (credit card processing companies, banks, and so on). But merely using data isn't really what we mean by "data science." A data application acquires its value from the data itself, and creates more data as a result. It's not just an application with data; it's a data product. Data science enables the creation of data products.

One of the earlier data products on the Web was the CDDB database. The developers of CDDB realized that any CD had a unique signature, based on the exact length (in samples) of each track on the CD. Gracenote built a database of track lengths, and coupled it to a database of album metadata (track titles, artists, album titles). If you've ever used iTunes to rip a CD, you've taken advantage of this database. Before it does anything else, iTunes reads the length of every track, sends it to CDDB, and gets back the track titles. If you have a CD that's not in the database (including a CD you've made yourself), you can create an entry for an unknown album. While this sounds simple enough, it's revolutionary: CDDB views music as data, not as audio, and creates new value in doing so. Their business is fundamentally different from selling music, sharing music, or analyzing musical tastes (though these can also be "data products"). CDDB arises entirely from viewing a musical problem as a data problem.

Google is a master at creating data products. Here's a few examples:

  • Google's breakthrough was realizing that a search engine could use input other than the text on the page. Google's PageRank algorithm was among the first to use data outside of the page itself, in particular, the number of links pointing to a page. Tracking links made Google searches much more useful, and PageRank has been a key ingredient to the company's success.

  • Spell checking isn't a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate. They've built a dictionary of common misspellings, their corrections, and the contexts in which they occur.

  • Speech recognition has always been a hard problem, and it remains difficult. But Google has made huge strides by using the voice data they've collected, and has been able to integrate voice search into their core search engine.

  • During the Swine Flu epidemic of 2009, Google was able to track the progress of the epidemic by following searches for flu-related topics.

Flu trends

Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing searches that people were making in different regions of the country.

Google isn't the only company that knows how to use data. Facebook and LinkedIn use patterns of friendship relationships to suggest other people you may know, or should know, with sometimes frightening accuracy. Amazon saves your searches, correlates what you search for with what other users search for, and uses it to create surprisingly appropriate recommendations. These recommendations are "data products" that help to drive Amazon's more traditional retail business. They come about because Amazon understands that a book isn't just a book, a camera isn't just a camera, and a customer isn't just a customer; customers generate a trail of "data exhaust" that can be mined and put to use, and a camera is a cloud of data that can be correlated with the customers' behavior, the data they leave every time they visit the site.

The thread that ties most of these applications together is that data collected from users provides added value. Whether that data is search terms, voice samples, or product reviews, the users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science.

In the last few years, there has been an explosion in the amount of data that's available. Whether we're talking about web server logs, tweet streams, online transaction records, "citizen science," data from sensors, government data, or some other source, the problem isn't finding data, it's figuring out what to do with it. And it's not just companies using their own data, or the data contributed by their users. It's increasingly common to mashup data from a number of sources. "Data Mashups in R" analyzes mortgage foreclosures in Philadelphia County by taking a public report from the county sheriff's office, extracting addresses and using Yahoo to convert the addresses to latitude and longitude, then using the geographical data to place the foreclosures on a map (another data source), and group them by neighborhood, valuation, neighborhood per-capita income, and other socio-economic factors.

The question facing every company today, every startup, every non-profit, every project site that wants to attract a community, is how to use data effectively -- not just their own data, but all the data that's available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

To get a sense for what skills are required, let's look at the data lifecycle: where it comes from, how you use it, and where it goes.

Where data comes from

Data is everywhere: your government, your web server, your business partners, even your body. While we aren't drowning in a sea of data, we're finding that almost everything can (or has) been instrumented. At O'Reilly, we frequently combine publishing industry data from Nielsen BookScan with our own sales data, publicly available Amazon data, and even job data to see what's happening in the publishing industry. Sites like Infochimps and Factual provide access to many large datasets, including climate data, MySpace activity streams, and game logs from sporting events. Factual enlists users to update and improve its datasets, which cover topics as diverse as endocrinologists to hiking trails.

1956 disk drive

One of the first commercial disk drives from IBM. It has a 5 MB capacity and it's stored in a cabinet roughly the size of a luxury refrigerator. In contrast, a 32 GB microSD card measures around 5/8 x 3/8 inch and weighs about 0.5 gram.

Photo: Mike Loukides. Disk drive on display at IBM Almaden Research

Much of the data we currently work with is the direct consequence of Web 2.0, and of Moore's Law applied to data. The web has people spending more time online, and leaving a trail of data wherever they go. Mobile applications leave an even richer data trail, since many of them are annotated with geolocation, or involve video or audio, all of which can be mined. Point-of-sale devices and frequent-shopper's cards make it possible to capture all of your retail transactions, not just the ones you make online. All of this data would be useless if we couldn't store it, and that's where Moore's Law comes in. Since the early '80s, processor speed has increased from 10 MHz to 3.6 GHz -- an increase of 360 (not counting increases in word length and number of cores). But we've seen much bigger increases in storage capacity, on every level. RAM has moved from $1,000/MB to roughly $25/GB -- a price reduction of about 4000, to say nothing of the reduction in size and increase in speed. Hitachi made the first gigabyte disk drives in 1982, weighing in at roughly 250 pounds; now terabyte drives are consumer equipment, and a 32 GB microSD card weighs about half a gram. Whether you look at bits per gram, bits per dollar, or raw capacity, storage has more than kept pace with the increase of CPU speed.

The importance of Moore's law as applied to data isn't just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put into it. The data exhaust you leave behind whenever you surf the web, friend someone on Facebook, or make a purchase in your local supermarket, is all carefully collected and analyzed. Increased storage capacity demands increased sophistication in the analysis and use of that data. That's the foundation of data science.

So, how do we make that data useful? The first step of any data analysis project is "data conditioning," or getting data into a state where it's usable. We are seeing more data in formats that are easier to consume: Atom data feeds, web services, microformats, and other newer technologies provide data in formats that's directly machine-consumable. But old-style screen scraping hasn't died, and isn't going to die. Many sources of "wild data" are extremely messy. They aren't well-behaved XML files with all the metadata nicely in place. The foreclosure data used in "Data Mashups in R" was posted on a public website by the Philadelphia county sheriff's office. This data was presented as an HTML file that was probably generated automatically from a spreadsheet. If you've ever seen the HTML that's generated by Excel, you know that's going to be fun to process.

Data conditioning can involve cleaning up messy HTML with tools like Beautiful Soup, natural language processing to parse plain text in English and other languages, or even getting humans to do the dirty work. You're likely to be dealing with an array of data sources, all in different forms. It would be nice if there was a standard set of tools to do the job, but there isn't. To do data conditioning, you have to be ready for whatever comes, and be willing to use anything from ancient Unix utilities such as awk to XML parsers and machine learning libraries. Scripting languages, such as Perl and Python, are essential.

Once you've parsed the data, you can start thinking about the quality of your data. Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn't always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It's reported that the discovery of global warming was delayed because automated data collection tools discarded readings that were too low 1. In data science, what you have is frequently all you're going to get. It's usually impossible to get "better" data, and you have no alternative but to work with the data at hand.

If the problem involves human language, understanding the data adds another dimension to the problem. Roger Magoulas, who runs the data analysis group at O'Reilly, was recently searching a database for Apple job listings requiring geolocation skills. While that sounds like a simple task, the trick was disambiguating "Apple" from many job postings in the growing Apple industry. To do it well you need to understand the grammatical structure of a job posting; you need to be able to parse the English. And that problem is showing up more and more frequently. Try using Google Trends to figure out what's happening with the Cassandra database or the Python language, and you'll get a sense of the problem. Google has indexed many, many websites about large snakes. Disambiguation is never an easy task, but tools like the Natural Language Toolkit library can make it simpler.

When natural language processing fails, you can replace artificial intelligence with human intelligence. That's where services like Amazon's Mechanical Turk come in. If you can split your task up into a large number of subtasks that are easily described, you can use Mechanical Turk's marketplace for cheap labor. For example, if you're looking at job listings, and want to know which originated with Apple, you can have real people do the classification for roughly $0.01 each. If you have already reduced the set to 10,000 postings with the word "Apple," paying humans $0.01 to classify them only costs $100.

Working with data at scale

We've all heard a lot about "big data," but "big" is really a red herring. Oil companies, telecommunications companies, and other data-centric industries have had huge datasets for a long time. And as storage capacity continues to expand, today's "big" is certainly tomorrow's "medium" and next week's "small." The most meaningful definition I've heard: "big data" is when the size of the data itself becomes part of the problem. We're discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.

What are we trying to do with data that's different? According to Jeff Hammerbacher 2 (@hackingdata), we're trying to build information platforms or dataspaces. Information platforms are similar to traditional data warehouses, but different. They expose rich APIs, and are designed for exploring and understanding the data rather than for traditional analysis and reporting. They accept all data formats, including the most messy, and their schemas evolve as the understanding of the data changes.

Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale. Managing sharding and replication across a horde of database servers is difficult and slow. The need to define a schema in advance conflicts with reality of multiple, unstructured data sources, in which you may not know what's important until after you've analyzed the data. Relational databases are designed for consistency, to support complex transactions that can easily be rolled back if any one of a complex set of operations fails. While rock-solid consistency is crucial to many applications, it's not really necessary for the kind of analysis we're discussing here. Do you really care if you have 1,010 or 1,012 Twitter followers? Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative: if you're asking whether sales to Northern Europe are increasing faster than sales to Southern Europe, you aren't concerned about the difference between 5.92 percent annual growth and 5.93 percent.

To store huge datasets effectively, we've seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases, though neither term is very useful. They group together fundamentally dissimilar products by telling you what they aren't. Many of these databases are the logical descendants of Google's BigTable and Amazon's Dynamo, and are designed to be distributed across many nodes, to provide "eventual consistency" but not absolute consistency, and to have very flexible schema. While there are two dozen or so products available (almost all of them open source), a few leaders have established themselves:

  • Cassandra: Developed at Facebook, in production use at Twitter, Rackspace, Reddit, and other large sites. Cassandra is designed for high performance, reliability, and automatic replication. It has a very flexible data model. A new startup, Riptano, provides commercial support.

  • HBase: Part of the Apache Hadoop project, and modelled on Google's BigTable. Suitable for extremely large databases (billions of rows, millions of columns), distributed across thousands of nodes. Along with Hadoop, commercial support is provided by Cloudera.

    Storing data is only part of building a data platform, though. Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the "map" stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google's biggest problem, creating large searches. It's easy to distribute a search across thousands of processors, and then combine the results into a single set of answers. What's less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning.

    The most popular open source implementation of MapReduce is the Hadoop project. Yahoo's claim that they had built the world's largest production Hadoop application, with 10,000 cores running Linux, brought it onto center stage. Many of the key Hadoop developers have found a home at Cloudera, which provides commercial support. Amazon's Elastic MapReduce makes it much easier to put Hadoop to work without investing in racks of Linux machines, by providing preconfigured Hadoop images for its EC2 clusters. You can allocate and de-allocate processors as needed, paying only for the time you use them.

    Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it's the key component of a data platform. It incorporates HDFS, a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components. If anything can be called a one-stop information platform, Hadoop is it.

    Hadoop has been instrumental in enabling "agile" data analysis. In software development, "agile practices" are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. It's easer to consult with clients to figure out whether you're asking the right questions, and it's possible to pursue intriguing possibilities that you'd otherwise have to drop for lack of time.

    Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter. These features only require soft real-time; reports on trending topics don't require millisecond accuracy. As with the number of followers on Twitter, a "trending topics" report only needs to be current to within five minutes -- or even an hour. According to Hilary Mason (@hmason), data scientist at, it's possible to precompute much of the calculation, then use one of the experiments in real-time MapReduce to get presentable results.

    Machine learning is another essential tool for the data scientist. We now expect web and mobile applications to incorporate recommendation engines, and building a recommendation engine is a quintessential artificial intelligence problem. You don't have to look at many modern web applications to see classification, error detection, image matching (behind Google Goggles and SnapTell) and even face detection -- an ill-advised mobile application lets you take someone's picture with a cell phone, and look up that person's identity using photos available online. Andrew Ng's Machine Learning course is one of the most popular courses in computer science at Stanford, with hundreds of students (this video is highly recommended).

    There are many libraries available for machine learning: PyBrain in Python, Elefant, Weka in Java, and Mahout (coupled to Hadoop). Google has just announced their Prediction API, which exposes their machine learning algorithms for public use via a RESTful interface. For computer vision, the OpenCV library is a de-facto standard.

    Mechanical Turk is also an important part of the toolbox. Machine learning almost always requires a "training set," or a significant body of known data with which to develop and tune the application. The Turk is an excellent way to develop training sets. Once you've collected your training data (perhaps a large collection of public photos from Twitter), you can have humans classify them inexpensively -- possibly sorting them into categories, possibly drawing circles around faces, cars, or whatever interests you. It's an excellent way to classify a few thousand data points at a cost of a few cents each. Even a relatively large job only costs a few hundred dollars.

    While I haven't stressed traditional statistics, building statistical models plays an important role in any data analysis. According to Mike Driscoll (@dataspora), statistics is the "grammar of data science." It is crucial to "making data speak coherently." We've all heard the joke that eating pickles causes death, because everyone who dies has eaten pickles. That joke doesn't work if you understand what correlation means. More to the point, it's easy to notice that one advertisement for R in a Nutshell generated 2 percent more conversions than another. But it takes statistics to know whether this difference is significant, or just a random fluctuation. Data science isn't just about the existence of data, or making guesses about what that data might mean; it's about testing hypotheses and making sure that the conclusions you're drawing from the data are valid. Statistics plays a role in everything from traditional business intelligence (BI) to understanding how Google's ad auctions work. Statistics has become a basic skill. It isn't superseded by newer techniques from machine learning and other disciplines; it complements them.

    While there are many commercial statistical packages, the open source R language -- and its comprehensive package library, CRAN -- is an essential tool. Although R is an odd and quirky language, particularly to someone with a background in computer science, it comes close to providing "one stop shopping" for most statistical work. It has excellent graphics facilities; CRAN includes parsers for many kinds of data; and newer extensions extend R into distributed computing. If there's a single tool that provides an end-to-end solution for statistics work, R is it.

    Making data tell its story

    A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph. Edward Tufte's Visual Display of Quantitative Information is the classic for data visualization, and a foundational text for anyone practicing data science. But that's not really what concerns us here. Visualization is crucial to each stage of the data scientist. According to Martin Wattenberg (@wattenberg, founder of Flowing Media), visualization is key to data conditioning: if you want to find out just how bad your data is, try plotting it. Visualization is also frequently the first step in analysis. Hilary Mason says that when she gets a new data set, she starts by making a dozen or more scatter plots, trying to get a sense of what might be interesting. Once you've gotten some hints at what the data might be saying, you can follow it up with more detailed analysis.

    There are many packages for plotting and presenting data. GnuPlot is very effective; R incorporates a fairly comprehensive graphics package; Ben Fry's Processing is the state of the art, particularly if you need to create animations that show how things change over time. At IBM's Many Eyes, many of the visualizations are full-fledged interactive applications.

    Nathan Yau's FlowingData blog is a great place to look for creative visualizations. One of my favorites is this animation of the growth of Walmart over time. And this is one place where "art" comes in: not just the aesthetics of the visualization itself, but how you understand it. Does it look like the spread of cancer throughout a body? Or the spread of a flu virus through a population? Making data tell its story isn't just a matter of presenting results; it involves making connections, then going back to other data sources to verify them. Does a successful retail chain spread like an epidemic, and if so, does that give us new insights into how economies work? That's not a question we could even have asked a few years ago. There was insufficient computing power, the data was all locked up in proprietary sources, and the tools for working with the data were insufficient. It's the kind of question we now ask routinely.

    Data scientists

    Data science requires skills ranging from traditional computer science to mathematics to art. Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:

    ... on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization 3

    Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be "hard scientists," particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you've just spent a lot of grant money generating data, you can't just throw the data out if it isn't as clean as you'd like. You have to make it tell its story. You need some creativity for when the story the data is telling isn't what you think it's telling.

    Scientists also know how to break large problems up into smaller problems. Patil described the process of creating the group recommendation feature at LinkedIn. It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn's membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members' profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn's data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database -- but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.

    This is the heart of what Patil calls "data jiujitsu" -- using smaller auxiliary problems to solve a large, difficult problem that appears intractable. CDDB is a great example of data jiujitsu: identifying music by analyzing an audio stream directly is a very difficult problem (though not unsolvable -- see midomi, for example). But the CDDB staff used data creatively to solve a much more tractable problem that gave them the same result. Computing a signature based on track lengths, and then looking up that signature in a database, is trivially simple.

    Hiring trends for data science

    It's not easy to get a handle on jobs in data science. However, data from O'Reilly Research shows a steady year-over-year increase in Hadoop and Cassandra job listings, which are good proxies for the "data science" market as a whole. This graph shows the increase in Cassandra jobs, and the companies listing Cassandra positions, over time.

    Entrepreneurship is another piece of the puzzle. Patil's first flippant answer to "what kind of person are you looking for when you hire a data scientist?" was "someone you would start a company with." That's an important insight: we're entering the era of products that are built on data. We don't yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products. Hilary Mason came to the same conclusion. hHer job as scientist at is really to investigate the data that is generating, and find out how to build interesting products from it. No one in the nascent data industry is trying to build the 2012 Nissan Stanza or Office 2015; they're all trying to find new products. In addition to being physicists, mathematicians, programmers, and artists, they're entrepreneurs.

    Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: "here's a lot of data, what can you make from it?"

    The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like are following their path. Whether it's mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian's quote that nobody remembers says it all:

    The ability to take data -- to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it -- that's going to be a hugely important skill in the next decades.

    Data is indeed the new Intel Inside.

    Making Data Work Online Conference Making Data Work: Practical Applications of Data Science
    On June 9, 2010, O'Reilly Media will hold the Making Data Work online conference, with Mike Driscoll, Joe Adler, Hilary Mason, and Ben Fry speaking. Use the code md10rdx for a 40-percent discount.

    O'Reilly publications related to data science

    R in a Nutshell
    A quick and practical reference to learn what is becoming the standard for developing statistical software.

    Statistics in a Nutshell
    An introduction and reference for anyone with no previous background in statistics.

    Data Analysis with Open Source Tools

    This book shows you how to think about data and the results you want to achieve with it.

    Programming Collective Intelligence
    Learn how to build web applications that mine the data created by people on the Internet.

    Beautiful Data
    Learn from the best data practitioners in the field about how wide-ranging -- and beautiful -- working with data can be.

    Beautiful Visualization
    This book demonstrates why visualizations are beautiful not only for their aesthetic design, but also for elegant layers of detail.

    Head First Statistics
    This book teaches statistics through puzzles, stories, visual aids, and real-world examples.

    Head First Data Analysis
    Learn how to collect your data, sort the distractions from the truth, and find meaningful patterns.

    1 The NASA article denies this, but also says that in 1984, they decided that the low values (whch went back to the 70s) were "real." Whether humans or software decided to ignore anomalous data, it appears that data was ignored.

    2 "Information Platforms as Dataspaces," by Jeff Hammerbacher (in Beautiful Data)

    3 "Information Platforms as Dataspaces," by Jeff Hammerbacher (in Beautiful Data)

    Reposted byconsumelater consumelater

    April 19 2010

    Big data analytics: From data scientists to business analysts

    The growing popularity of Big Data management tools (Hadoop; MPP, real-time SQL, NoSQL databases; and others1) means many more companies can handle large amounts of data. But how do companies analyze and mine their vast amounts of data? The cutting-edge (social) web companies employ teams of data scientists2 who comb through data using different Hadoop interfaces and use custom analysis and visualization tools. Other companies integrate their MPP databases with familiar Business Intelligence tools. For companies that already have large amounts of data in Hadoop, there's room for even simpler tools that would allow business users to directly interact with Big Data.

    A startup aims to expose Big Data to analysts charged with producing most routine reports. Datameer3 has an interesting workflow model that enables spreadsheet users to quickly perform analytics with data in Hadoop. The Datameer Analytics Solution (DAS) assumes data sits in Hadoop4, and from there a business analyst can rapidly load, transform, analyze, and visualize data:


    Datameer's workflow uses the familiar spreadsheet interface as a data processing pipeline. Random samples are pulled into worksheets where spreadsheet functions let analysts customize transformations, aggregations, and joins5. Once their analytic models are created, results are computed via Hadoop's distributed processing technology (computations are initiated through a simple GUI). DAS contains over a hundred standard spreadsheet functions, NLP tools (tokenization, ngrams) for unstructured data, and basic charting tools.

    What's intriguing about DAS is that it opens up Big Data analysis to large sets of business users. Based on the private demo we saw last week, we think Datameer is off to a good start. While still in beta, DAS has been deployed by many customers and feedback from users has resulted in an intuitive and extremely useful analytic tool. With DAS, spreadsheet users will be able to perform Big Data analysis without assistance from their colleagues in IT.

    The buzz over Big Data has so far centered largely on (new) data management tools6. More recently, we're hearing from companies eager to tackle the next step: Big Data analysis ranging from routine reports to complex quantitative models. On one end, machine-learning algorithms and statistics are starting to appear as in-database analytic functions. At the other end, companies besides Datameer will develop Big Data analysis tools for average users (i.e., users who won't learn BI tools, SQL, Pig, Hive, and the like). If money isn't an issue, IBM's ambitious (and still immature) BigSheets project goes a step further than Datameer. It aims to provide data scientists with a single tool that can handle data acquisition (web crawlers), data management (Hadoop), text mining, and visualization (many eyes).

    (1) Splunk is a tool that does both Big Data management and analytics.
    (2) In fact data scientist is a title that's increasingly used in companies like Yahoo!, Facebook, Linkedin, Twitter, the NY Times, ...
    (3) Datameer is a San Mateo startup, with some engineers in Germany. The company name is based on the German word for ocean.
    (4) DAS can actually handle data from a variety of other sources, but for now, data from other sources gets pipelined to Hadoop in (near) real-time.
    (5) Spreadsheet users should quickly be able to merge data sources with DAS: joins are done between worksheets and are intuitive. DAS is a single-tool that can handle data manipulation, analysis, and visualization, thus reducing the need to switch back-and-forth between multiple tools.
    (6) Along with the cool new data management tools, there are occasional stories of amazing custom analytics produced by data scientists.

    Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
    Could not load more posts
    Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
    Just a second, loading more posts...
    You've reached the end.
    Get rid of the ads (sfw)

    Don't be the product, buy the product!