Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 06 2012

Four short links: 6 February 2012

  1. Jirafe -- open source e-commerce analytics for Magento platform.
  2. iModela -- a $1000 3D milling machine. (via BoingBoing)
  3. It's Too Late to Save The Common Web (Robert Scoble) -- paraphrased: "Four years ago, I told you all that Google and Facebook were evil. You did nothing, which is why I must now use Google and Facebook." His list of reasons that Facebook beats the Open Web gives new shallows to the phrase "vanity metrics". Yes, the open web does not go out of its way to give you an inflated sense of popularity and importance. On the other hand, the things you do put there are in your control and will stay as long as you want them to. But that's obviously not a killer feature compared to a bottle of Astroglide and an autorefreshing page showing your Klout score and the number of Google+ circles you're in.
  4. iBooks Author EULA Clarified (MacObserver) -- important to note that it doesn't say you can't use the content you've written, only that you can't sell .ibook files through anyone but Apple. Less obnoxious than the "we own all your stuff, dude" interpretation, but still a bit crap. I wonder how anticompetitive this will be seen as. Apple's vertical integration is ripe for Justice Department investigation.

December 26 2011

Four short links: 26 December 2011

  1. Pattern -- a BSD-licensed bundle of Python tools for data retrieval, text analysis, and data visualization. If you were going to get started with accessible data (Twitter, Google), the fundamentals of analysis (entity extraction, clustering), and some basic visualizations of graph relationships, you could do a lot worse than to start here.
  2. Factorie (Google Code) -- Apache-licensed Scala library for a probabilistic modeling technique successfully applied to [...] named entity recognition, entity resolution, relation extraction, parsing, schema matching, ontology alignment, latent-variable generative models, including latent Dirichlet allocation. The state-of-the-art big data analysis tools are increasingly open source, presumably because the value lies in their application not in their existence. This is good news for everyone with a new application.
  3. Playtomic -- analytics as a service for gaming companies to learn what players actually do in their games. There aren't many fields untouched by analytics.
  4. Write or Die -- iPad app for writers where, if you don't keep writing, it begins to delete what you wrote earlier. Good for production to deadlines; reflective editing and deep thought not included.

November 30 2011

November 08 2011

When good feedback leaves a bad impression a teacher is prone to hyperbole — lots of "greats!" and "excellents!" and "A+++" grades — it's natural for a student to perceive a mere "good" as an undesirable response. According to Panagiotis Ipeirotis, associate professor at New York University, the same perception applies to online reviews.

In a recent interview, Ipeirotis touched on the the negative impact of good-enough reviews and a host of other data-related topics. Highlights from the interview (below) included:

  • Sentiment analysis is a commonly used tool for measuring what people are saying about a particular company or brand, but it has issues. "The problem with sentiment analysis," said Ipeirotis, "is that it tends to be rather generic, and it's not customized to the context in which people read." Ipeirotis pointed to Amazon as a good example here, where customer feedback about a merchant that says "good packaging" might initially appear as positive sentiment, but "good" feedback can have a negative effect on sales. "People tend to exaggerate a lot on Amazon. 'Excellent seller.' 'Super-duper service.' 'Lightning-fast delivery.' So when someone says 'good packaging,' it's perceived as, 'that's all you've got?'" [Discussed at the 0:42 mark.]
  • Ipeirotis suggested that people should challenge the initial conclusions they make from data. "Every time that something seems to confirm your intuition too much, I think it's good to ask for feedback." [Discussed at 2:24.]
  • Ipeirotis has done considerable research on Amazon's Mechanical Turk (MTurk) platform. He described MTurk as "an interesting example of a market that started with the wrong design." Amazon thought that its cloud-based labor service would be "yet another of its cloud services." But a market that "involves people who are strategic and responding to incentives," said Ipeirotis, "is very different than a market for CPUs and so on." Because Amazon didn't take this into consideration early on, the service has faced spam and reputation issues. Ipeirotis pointed to the site's use of anonymity as an example: Anonymity was supposed to protect privacy, but it's actually hurt some of the people who are good at what they do because anonymity is often associated with spammers. [Discussed at 2:55.]

The full interview is available in the following video:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Some quotes from this interview were edited and condensed for clarity.


October 26 2011

Mobile analytics unlock the what and the when

When applied appropriately, mobile analytics reveal both what happened and when it happened. Case in point: "Let's say you have a game," said Flurry CTO Sean Byrnes (@FlurryMobile) during a recent interview. "You want to measure not just that someone got to level 1, 2, 3, or 4, but how long does it take for them to get to those levels? Does someone get to level 3 in one week, get to level 4 in two weeks, and get to level 5 in four weeks? Maybe those levels are too difficult. Or maybe a user is just getting tired of the same mechanic and you need to give them something different as the game progresses." [Discussed at the 2:21 mark.]

This is why a baseline metric, such as general engagement, deserves more than a passing glance. The specific engagements tucked within can unlock a host of improvements.

Byrnes touched on a number of related topics during the full interview (below), including:

  • Why mobile developers are focusing on engagement: Once you engage a user, do they stick around? If it costs you $1 to acquire a user, how much return will you get — if any? Byrnes said app engagement has grown in importance as developers have shifted their thinking from apps as marketing channels to apps as businesses. [Discussed 30 second in.]
  • Tablet apps vs smartphone apps: A tablet app isn't the same as a phone app. Flurry has found that tablet applications are being used "a number of times longer" than phone applications, but tablet consumers use fewer applications overall. [Discussed at 3:28]

You can view the entire interview in the following video.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


September 14 2011

Social data: A better way to track TV

Solid State by skippyjon, on FlickrNielsen families, viewer diaries, and TV meters just won't cut it anymore. Divergent forms of television viewership require new audience measurement tools. Jodee Rich (@WingDude), CEO and founder of PeopleBrowsr, says social data is the key to new toolsets because it reveals both viewing behavior and sentiment.

Rich explores the connection between social data and television analytics in the following interview. He'll expand on these ideas during a presentation at next week's Strata Summit in New York.

Nielsen has been measuring audience response since the era of radio, yet the title of your Strata talk is "Move over, Nielsen." What is Nielsen's methodology, and why does it no longer suffice?

Jodee RichJodee Rich: Nielsen data is sampled across the United States from approximately 20,000 households. Data is aggregated every night, sent back to Nielsen, and broken out by real-time viewings and same-day viewings.

There are two flaws in Nielsen's rating system that we can address with social analytics:

  1. Nielsen's method for classifying shows as "watched" — The Nielsen system does not demonstrate a show's popularity as much as it showcases which commercials viewers tune in for. If a person switches the channel to avoid commercials, the time spent watching that show is not tallied. The show is only counted as watched in full when the viewer is present for commercials.
  2. Nielsen ratings don't measure mediums other than television — The system does not take into account many of the common ways people now access shows, including Hulu, Netflix, on-demand, and iTunes.

How does social data provide more accurate ways of measuring audience response?

Jodee Rich: Social media offers opportunities to measure sentiment like never before. The volume of data available through social media outlets simply dwarfs Nielsen's sample base of 20,000 households. Millions of people form the social media user base, and naturally that base is more representative of the dynamics of an evolving demographic.

It's not just the volume, however. Social media values real-time engagement over passive participation. We can see not just what people are watching, but also monitor what they say about it. By observing actively engaged people, we can better discern who the viewers are, what they value, what they discuss, how often they talk about these things, and most importantly, how they feel about it. This knowledge allows brands to tailor messages with very high relevance.

Strata Summit New York 2011, being held Sept. 20-21, is for executives, entrepreneurs, and decision-makers looking to harness data. Hear from the pioneers who are succeeding with data-driven strategies, and discover the data opportunities that lie ahead.

Save 30% on registration with the code ORM30

How will these new measurement tools benefit viewers?

Jodee Rich: With social data, the television experience will be better catered to viewers. Broadcasters will enrich the viewing experience by creating flexible, responsive services that are sensitive to real people's tastes and conversations. We believe that ultimately this will make for more engaging entertainment and prolong the lives of the shows people love.

This interview was edited and condensed

Photo: Solid State by skippyjon, on Flickr


September 02 2011

Top Stories: August 29-September 2, 2011

Here's a look at the top stories published across O'Reilly sites this week.

How to create sustainable open data projects with purpose
Tom Steinberg, head of the UK's civic-hacking non-profit mySociety, uses the launch of the new FixMyTransport to reflect on how organizations can help their open data efforts achieve sustainability.

Why the finance world should care about big data and data science
O'Reilly director of market research Roger Magoulas discusses the intersection of big data and finance, and the opportunities this pairing creates for financial experts.

When was the last time you mined your site's search data?
A gold mine is hiding in the data generated by website search engines, yet many site owners pay little attention to the analytics those engines yield. Author Lou Rosenfeld explains why site search is worth your time.

Subscription vs catchment
When people are trawling so many content sources, it no longer pays to concentrate on sources at all. It makes much more sense to study how the trawlers work and become part of the filtering infrastructure.

Government IT's quiet open source evolution
Packed halls at the 2011 Government Open Source Conference (GOSCON) confirmed that strong interest in open source runs throughout the federal IT community.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively. Save 30% on registration with the code ORM30.

August 29 2011

The application of real-time data

From her vantage point as chief scientist of Bitly, Hilary Mason has interesting insight into the real-time web and what people are sharing, posting, clicking and reading.

I recently spoke with Mason about Bitly's analysis and usage of real-time data. She'll be digging into these same topics at next month's Strata Conference in New York.

Our interview follows.

How does Bitly develop its data products and processes?

Hilary MasonHilary Mason: Our primary goal at Bitly is to understand what's happening on the Internet in real-time. We work by stating the problem we're trying to solve, brainstorming methods and models on the whiteboard, then experimenting on subsets of the data. Once we have a methodology in mind that we're fairly certain will work at scale, we build a prototype of the system, including data ingestion, storage, processing, and (usually) an API. Once we've proven it at that scale, we might decide to scale it to the full dataset or wait and see where it will plug into a product.

How does data drive Bitly's application of analytics and data science?

Hilary Mason: Bitly is a data-centric organization. The data informs business decisions, the potential of the product, and certainly our own internal processes. That said, it's important to draw a distinction between analytics and data science. Analytics is the measurement of well-understood metrics. Data science is the invention of new mathematical and algorithmic approaches to understanding the data. We do both, but apply them in very different ways.

What are the most important applications of real-time data?

Hilary Mason: The most important applications of real-time data apply to situations where having analysis immediately will change the outcome. More practically, when you can ask a question and get the answer before you've forgotten why you asked the question in the first place, it makes you massively more productive.

This interview was edited and condensed.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD


August 08 2011

Mobile metrics: Like the web, but a lot harder

Flurry is a mobile analytics company that currently tracks more than 10,000 Android applications and 500 million application sessions per day. Since Flurry supports all the major mobile platforms, the company is in a good position to comment on the relative merits and challenges of Android development.

I recently spoke with Sean Byrnes (@FlurryMobile), the CTO and co-founder of Flurry, about the state of mobile analytics, the strengths and weaknesses of Android, and how big a problem fragmentation is for the Android ecosystem. Byrnes will lead a session on "Android App Engagement by the Numbers" at the upcoming Android Open Conference.

Our interview follows.

What challenges do mobile platforms present for analytics?

Sean Byrnes: Mobile application analytics are interesting because the concepts are the same as traditional web analytics but the implementation is very different. For example, with web analytics you can always assume that a network connection is available since the user could not have loaded the web page otherwise. A large amount of mobile application usage happens when no network is available because either the cellular network is unreliable or there is no Wi-Fi nearby. As a result, analytics data has to be stored locally on the device and reported whenever a network is available, which can be weeks after the actual application session. This is complicated by the fact that about 10% of phones have bad clocks and report dates that in some cases are decades in the future.

Another challenge is that applications are downloaded onto the phone as opposed to a website where the content is all dynamic. This means that you have to think through all of the analytics you want to track for your application before you release it, because you can't change the tracking points once it's downloaded.

What metrics are developers most interested in?

Sean Byrnes: Developers are typically focused on metrics that either make or save them the most money. Typically these revolve around retention, engagement and commerce.

One of the interesting things about the mobile application market is how the questions developers ask are changing because the market is maturing so quickly. Until recently the primary metrics used to measure applications were the number of daily active users and the total time spent in the app. These metrics help you understand the size of your audience and they're important when you're focusing on user acquisition. Now, the biggest questions being asked are centering on user retention and the lifetime value of a user. This is a natural shift in focus as applications begin to focus on profitability and developers need to understand how much money they make from users as compared to their acquisition cost. For example, if my application has 100,000 daily active users but an active user only uses the application once, then I have a high level of engagement that is very expensive to maintain.

Social game developers are among the most advanced, scientific and detailed measurement companies. Zynga, for example, measures every consumer action in impressive detail, all for the purpose of maximizing engagement and monetization. They consider themselves a data company as much as a game maker.

Android Open, being held October 9-11 in San Francisco, is a big-tent meeting ground for app and game developers, carriers, chip manufacturers, content creators, OEMs, researchers, entrepreneurs, VCs, and business leaders.

Save 20% on registration with the code AN11RAD

What do you see as Android's biggest strengths and weaknesses?

Sean Byrnes: Android's biggest strength is adoption and support by so many phone manufacturers, and the fact that it's relatively open, which means development iterations can happen faster and you have the freedom to experiment. The installed base is now growing faster than iOS, although the iOS installed base is still larger, as of now.

Android's biggest weakness is that it offers less business opportunity to developers. Fewer Android consumers have credit cards associated with their accounts, and they expect to get more free apps. The most common business model on Android is still built on ad revenue. Also, because the Android Market is not curated, you can end up with a lower-average-quality app, which further reduces consumer confidence. At the end of the day, developers want a business, not an OS. And they need a marketplace that brings in consumers who are willing and able to pay. This is Android's biggest challenge at the moment.

Is Android fragmentation a problem for developers?

Sean Byrnes: Fragmentation is definitely a problem for Android developers. It's not just the sheer number of new Android devices entering the market, but also the speed at which new versions of the Android OS are being released. As a developer you have to worry about handling both a variety of hardware capabilities and a variety of Android OS capabilities for every application version you release. It's difficult to be truly innovative and take advantage of advanced features since the path of least resistance is to build to the lowest common denominator.

It will likely get worse before it gets better with the current roadmap for Android OS updates and the number of devices coming to market. However, fragmentation will become less of a concern as developers can make more money on Android and the cost of supporting a device becomes insignificant compared to the revenue it can generate.


August 02 2011

Data and the human-machine connection

Arnab Gupta is the CEO of Opera Solutions, an international company offering big data analytics services. I had the chance to chat with him recently about the massive task of managing big data and how humans and machines intersect. Our interview follows.

Tell me a bit about your approach to big data analytics.

Arnab GuptaArnab Gupta: Our company is a science-oriented company, and the core belief is that behavior — human or otherwise — can be mathematically expressed. Yes, people make irrational value judgments, but they are driven by common motivation factors, and the math expresses that.

I look at the so-called "big data phenomenon" as the instantiation of human experience. Previously, we could not quantitatively measure human experience, because the data wasn't being captured. But Twitter recently announced that they now serve 350 billion tweets a day. What we say and what we do has a physical manifestation now. Once there is a physical manifestation of a phenomenon, then it can be mathematically expressed. And if you can express it, then you can shape business ideas around it, whether that's in government or health care or business.

How do you handle rapidly increasing amounts of data?

Arnab Gupta: It's an impossible battle when you think about it. The amount of data is going to grow exponentially every day, ever week, every year, so capturing it all can't be done. In the economic ecosystem there is extraordinary waste. Companies spend vast amounts of money, and the ratio of investment to insight is growing, with much more investment for similar levels of insight. This method just mathematically cannot work.

So, we don't look for data, we look for signal. What we've said is that the shortcut is a priori identifying the signals to know where the fish are swimming, instead of trying to dam the water to find out which fish are in it. We focus on the flow, not a static data capture.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

What role does visualization play in the search for signal?

Arnab Gupta: Visualization is essential. People dumb it down sometimes by calling it "UI" and "dashboards," and they don't apply science to the question of how people perceive. We need understanding that feeds into the left brain through the right brain via visual metaphor. At Opera Solutions, we are increasingly trying to figure out the ways in which the mind understands and transforms the visualization of algorithms and data into insights.

If understanding is a priority, then which do you prefer: a black-box model with better predictability, or a transparent model that may be less accurate?

Arnab Gupta: People bifurcate, and think in terms of black-box machines vs. the human mind. But the question is whether you can use machine learning to feed human insight. The power lies in expressing the black box and making it transparent. You do this by stress testing it. For example, if you were looking at a model for mortgage defaults, you would say, "What happens if home prices went down by X percent, or interest rates go up by X percent?" You make your own heuristics, so that when you make a bet you understand exactly how the machine is informing your bet.

Humans can do analysis very well, but the machine does it consistently well; it doesn't make mistakes. What the machine lacks is the ability to consider orthogonal factors, and the creativity to consider what could be. The human mind fills in those gaps and enhances the power of the machine's solution.

So you advocate a partnership between the model and the data scientist?

Arnab Gupta: We often create false dichotomies for ourselves, but the truth is it's never been man vs. machine; it has always been man plus machine. Increasingly, I think it's an article of faith that the machine beats the human in most large-scale problems, even chess. But though the predictive power of machines may be better on a large-scale basis, if the human mind is trained to use it powerfully, the possibilities are limitless. In the recent Jeopardy showdown with IBM's Watson, I would have had a three-way competition with Watson, a Jeopardy champion, and a combination of the two. Then you would have seen where the future lies.

Does this mean we need to change our approach to education, and train people to use machines differently?

Arnab Gupta: Absolutely. If you look back in time between now and the 1850s, everything in the world has changed except the classroom. But I think we are dealing with a phase-shift occurring. Like most things, the inertia of power is very hard to shift. Change can take a long time and there will be a lot of debris in the process.

One major hurdle is that the language of machine-plus-human interaction has not yet begun to be developed. It's partly a silent language, with data visualization as a significant key. The trouble is that language is so powerful that the left brain easily starts dominating, but really almost all of our critical inputs come from non-verbal signals. We have no way of creating a new form of language to describe these things yet. We are at the beginning of trying to develop this.

Another open question is: What's the skill set and the capabilities necessary for this? At Opera we have focused on the ability to teach machines how to learn. We have 150-160 people working in that area, which is probably the largest private concentration in that area outside IBM and Google. One of the reasons we are hiring all these scientists is to try to innovate at the level of core competencies and the science of comprehension.

The business outcome of that is simply practical. At the end of the day, much of what we do is prosaic; it makes money or it doesn't make money. It's a business. But the philosophical fountain from which we drink needs to be a deep one.


July 25 2011

How data and analytics can improve education

Schools have long amassed data: tracking grades, attendance, textbook purchases, test scores, cafeteria meals, and the like. But little has actually been done with this information — whether due to privacy issues or technical capacities — to enhance students' learning.

With the adoption of technology in more schools and with a push for more open government data, there are clearly a lot of opportunities for better data gathering and analysis in education. But what will that look like? It's a politically charged question, no doubt, as some states are turning to things like standardized test score data in order to gauge teacher effectiveness and, in turn, retention and promotion.

I asked education theorist George Siemens, from the Technology Enhanced Knowledge Research Institute at Athabasca University, about the possibilities and challenges for data, teaching, and learning.

Our interview follows.

What kinds of data have schools traditionally tracked?

George Siemens: Schools and universities have long tracked a broad range of learner data — often drawn from applications (universities) or enrollment forms (schools). This data includes any combination of: location, previous learning activities, health concerns (physical and emotional/mental), attendance, grades, socio-economic data (parental income), parental status, and so on. Most universities will store and aggregate this data under the umbrella of institutional statistics.

Privacy laws differ from country to country, but generally will prohibit academics from accessing data that is not relevant to a particular class, course, or program. Unfortunately, most schools and universities do very little with this wealth of data, other than possibly producing an annual institutional profile report. Even a simple analysis of existing institutional data could raise the profile of potential at-risk students or reveal attendance or assignment submission patterns that indicate the need for additional support.

What new types of educational data can now be captured and mined?

George Siemens: In terms of learning analytics or educational data-mining, the growing externalization of learning activity (i.e. capturing how learners interact with content and the discourse they have around learning materials as well as the social networks they form in the process) is driven by the increased attention to online learning. For example, a learning management system like Moodle or Desire2Learn captures a significant amount of data, including time spent on a resource, frequency of posting, number of logins, etc. This data is fairly similar to what Google Analytics or Piwik collects regarding website traffic. A new generation of tools, such as SNAPP, uses this data to analyze social networks, degrees of connectivity, and peripheral learners. Discourse analysis tools, such as those being developed at the Knowledge Media Institute at the Open University, UK, are also effective at evaluating the qualitative attributes of discourse and discussions and rate each learner's contributions by depth and substance in relation to the topic of discussion.

An area of data gathering that universities and schools are largely overlooking relates to the distributed social interactions learners engage in on a daily basis through Facebook, blogs, Twitter, and similar tools. Of course, privacy issues are significant here. However, as we are researching at Athabasca University, social networks can provide valuable insight into how connected learners are to each other and to the university. Potential models are already being developed on the web that would translate well to school settings. For example, Klout measures influence within a network and Radian6 tracks discussions in distributed networks.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

The existing data gathering in schools and universities pales in comparison to the value of data mining and learning analytics opportunities that exist in the distributed social and informational networks that we all participate in on a daily basis. It is here, I think, that most of the novel insights on learning and knowledge growth will occur. When we interact in a learning management system (LMS), we do so purposefully — to learn or to complete an assignment. Our interaction in distributed systems is more "authentic" and can yield novel insights into how we are connected, our sentiments, and our needs in relation to learning success. The challenge, of course, is how to balance concerns of the Hawthorne effect with privacy.

Discussions about data ownership and privacy lag well behind what is happening in learning analytics. Who owns learner-produced data? Who owns the analysis of that data? Who gets to see the results of analysis? How much should learners know about the data being collected and analyzed?

I believe that learners should have access to the same dashboard for analytics that educators and institutions see. Analytics can be a powerful tool in learner motivation — how do I compare to others in this class? How am I doing against the progress goals that I set? If data and analytics are going to be used for decision making in teaching and learning, then we need to have important conversations about who sees what and what are the power structures created by the rules we impose on data and analytics access.

How can analytics change education?

George Siemens: Education is, today at least, a black box. Society invests significantly in primary, secondary, and higher education. Unfortunately, we don't really know how our inputs influence or produce outputs. We don't know, precisely, which academic practices need to be curbed and which need to be encouraged. We are essentially swatting flies with a sledgehammer and doing a fair amount of peripheral damage.

Learning analytics are a foundational tool for informed change in education. Over the past decade, calls for educational reform have increased, but very little is understood about how the system of education will be impacted by the proposed reforms. I sometimes fear that the solution being proposed to what ails education will be worse than the current problem. We need a means, a foundation, on which to base reform activities. In the corporate sector, business intelligence serves this "decision foundation" role. In education, I believe learning analytics will serve this role. Once we better understand the learning process — the inputs, the outputs, the factors that contribute to learner success — then we can start to make informed decisions that are supported by evidence.

However, we have to walk a fine line in the use of learning analytics. On the one hand, analytics can provide valuable insight into the factors that influence learners' success (time on task, attendance, frequency of logins, position within a social network, frequency of contact with faculty members or teachers). Peripheral data analysis could include the use of physical services in a school or university: access to library resources and learning help services. On the other hand, analytics can't capture the softer elements of learning, such as the motivating encouragement from a teacher and the value of informal social interactions. In any assessment system, whether standardized testing or learning analytics, there is a real danger that the target becomes the object of learning, rather than the assessment of learning.

With that as a caveat, I believe learning analytics can provide dramatic, structural change in education. For example, today, our learning content is created in advance of the learners taking a course in the form of curriculum like textbooks. This process is terribly inefficient. Each learner has differing levels of knowledge when they start a course. An intelligent curriculum should adjust and adapt to the needs of each learner. We don't need one course for 30 learners; each learner should have her own course based on her life experiences, learning pace, and familiarity with the topic. The content in the courses that we take should be as adaptive, flexible, and continually updated. The black box of education needs to be opened and adapted to the requirements of each individual learner.

In terms of evaluation of learners, assessment should be in-process, not at the conclusion of a course in the form of an exam or a test. Let's say we develop semantically-defined learning materials and ways to automatically compare learner-produced artifacts (in discussions, texts, papers) to the knowledge structure of a field. Our knowledge profile could then reflect how we compare to the knowledge architecture of a domain — i.e. "you are 64% on your way to being a psychologist" or "you are 38% on your way to being a statistician." Basically, evaluation should be done based on a complete profile of an individual, not only the individual in relation to a narrowly defined subject area.

Programs of study should also include non-school-related learning (prior learning assessment). A student that volunteers with a local charity or a student that plays sports outside of school is acquiring skills and knowledge that is currently ignored by the school system. "Whole-person analytics" is required where we move beyond the micro-focus of exams. For students that return to university mid-career to gain additional qualifications, recognition for non-academic learning is particularly important.

Much of the current focus on analytics relates to reducing attrition or student dropouts. This is the low-hanging fruit of analytics. An analysis of the signals learners generate (or fail to — such as when they don't login to a course) can provide early indications of which students are at risk for dropping out. By recognizing these students and offering early interventions, schools can reduce dropouts dramatically.

All of this is to say that learning analytics serve as a foundation for informed change in education, altering how schools and universities create curriculum, deliver it, assess student learning, provide learning support, and even allocate resources.

What technologies are behind learning analytics?

George Siemens: Some of the developments in learning analytics track the development of the web as a whole — including the use of recommender systems, social network analysis, personalization, and adaptive content. We are at an exciting cross-over point between innovations in the technology space and research in university research labs. Language recognition, artificial intelligence, machine learning, neural networks, and related concepts are being combined with the growth of social network services, collaborative learning, and participatory pedagogy.

The combination of technical and social innovations in learning offers huge potential for a better, more effective learning model. Together with Stephen Downes and Dave Cormier, I've experimented with "massive open online courses" over the past four years. This experimentation has resulted in software that we've developed to encourage distributed learning, while still providing a loose level of aggregation that enables analytics. Tools like Open Study take a similar approach: decentralized learning, centralized analytics. Companies like Grockit and Knewton are creating personalized adaptive learning platforms. Not to be outdone, traditional publishers like Pearson and McGraw-Hill are investing heavily in adaptive learning content and are starting to partner with universities and schools to deliver the content and even evaluate learner performance. Learning management system providers (such as Desire2Learn and Blackboard) are actively building analytics options into their offerings.

Essentially, in order for learning analytics to have a broad impact in education, the focus needs to move well beyond basic analytics techniques such as those found in Google Analytics. An integrated learning and knowledge model is required where the learning content is adaptive, prior learning is included in assessment, and learning resources are provided in various contexts (e.g. "in class today you studied Ancient Roman laws, two blocks from where you are now, a museum is holding a special exhibit on Roman culture"). The profile of the learner, not pre-planned content, needs to drives curriculum and learning opportunities.

What are the major obstacles facing education data and analytics?

George Siemens: In spite of the enormous potential they hold to improve education, learning analytics are not without concerns. Privacy for learners and teachers is a critical issue. While I see analytics as a means to improve learner success, opportunities exist to use analytics to evaluate and critique the performance of teachers. Data access and ownership are equally important issues: who should be able to see the analysis that schools perform on learners? Other concerns relate to error-correction in analytics. If educators rely heavily on analytics, effort should be devoted to evaluating the analytics models and understanding in which contexts those analytics are not valid.

With regard to the adoption of learning analytics, now is an exceptionally practical time to explore analytics. The complex challenges that schools and universities face can, at least partially, be illuminated through analytics applications.


July 12 2011

Four short links: 12 July 2011

  1. Slopegraphs -- a nifty Tufte visualization which conveys rank, value, and delta over time. Includes pointers to how to make them, and guidelines for when and how they work. (via Avi Bryant)
  2. Ask Me Anything: A Technical Lead on the Google+ Team -- lots of juicy details about technology and dev process. A couple nifty tricks we do: we use the HTML5 History API to maintain pretty-looking URLs even though it's an AJAX app (falling back on hash-fragments for older browsers); and we often render our Closure templates server-side so the page renders before any JavaScript is loaded, then the JavaScript finds the right DOM nodes and hooks up event handlers, etc. to make it responsive (as a result, if you're on a slow connection and you click on stuff really fast, you may notice a lag before it does anything, but luckily most people don't run into this in practice). (via Nahum Wild)
  3. scalang (github) -- a Scala wrapper that makes it easy to interface with Erlang, so you can use two hipster-compliant built-to-scale technologies in the same project. (via Justin Sheehy)
  4. Madlib -- an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. (via Mike Loukides)

July 07 2011

Strata Week: How much of the web is archived?

Here are a few of the data stories that caught my attention this week.

How much of the web is archived?

Researchers at Old Dominion University are trying to ascertain how much of the web has actually been archived and preserved in various databases. Scott Ainsworth, Ahmed Alsum, Hany SalahEldeen, Michele Weigle, and Michael Nelson have published a paper (PDF) with their analysis of the current state of archiving.

The researchers have studied sample URIs from DMOZ, Delicious,, and search engine indexes in order to measure the number of archive copies available in various public web archives. According to their findings, between 35% to 90% of URIs have at least one archived copy. That's a huge range, and when you look at DMOZ, for example, you'll find a far higher rate of archiving than you will for links. That's hardly surprising, of course, as DMOZ is a primary source for the Internet Archive's efforts.

More troubling, perhaps, is the poor representation of links in archiving. The researchers say that the reason for this isn't entirely clear. Nonetheless, we should consider: what are the implications of this as and other URL shorteners become increasingly utilized?

In an article in The Chronicle, Alexis Rossi from the Internet Archive points out how this project helps to shed light on the web archiving process. "People are coming to the realization that if nobody saves the Internet, their work will just be gone."

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Better healthcare data collection

The Department of Health and Human Services announced last week that it would work to improve its collection of healthcare data, specifically around the collection and reporting of race, ethnicity, sex, primary language and disability status. The department also announced that it also plans to collect health data about lesbian, gay, bisexual, and transgender populations.

Feministing says this is a big deal — and indeed, it does mark an important move to help uncover some of the disparities in healthcare that many in the LGBT community know exist:

There is a lack of data on LGBT folks, who we do know face disparities in health and access to health services. Without federal health data, it's practically impossible to direct federal government resources to focus on health inequalities. Including sexual orientation in data collection will go a long way towards showing what LGB folks face. This data will make it possible to name and quantify real world problems, and to then direct government resources towards addressing them.

The Department of Health and Human Services positions the move for better data collection as part of its efforts to help address healthcare inequality. According to HHS Secretary Kathleen Sebelius: "The first step is to make sure we are asking the right questions. Sound data collection takes careful planning to ensure that accurate and actionable data is being recorded."

Twitter acquires Backtype for better data analytics

Analytics company Backtype announced this week that it had been acquired by Twitter. Backtype offers its customers the ability to track their social media impact across the web. Using BackType, you could get an RSS feed of all the comments posted across the blogosphere that were signed with a certain website's URL, providing a powerful tool by which you could track people's commentary and participation online.

With the acquisition, Backtype says that it will bring its analytics platform to help develop "tools for Twitter's publisher partners." No doubt, these sorts of analytics are a key piece of Twitter's value proposition, and it's becoming increasingly clear that the company is opting to bring these sorts of tools in-house, rather than relying on third-party vendors to supply the analytics tools through which people can gauge participation and interest in various pieces of content.

Got data news?

Feel free to email me.


July 05 2011

Search Notes: Why Google's Social Analytics tools matter

The big search news over the past week has been the launch of Google Plus, but lots of other stuff has been going on as well. Read on for the run dow.

Google social analytics

Plus isn't the only social launch Google had recently. The company also pushed out social analytics features in both Google Analytics and Google Webmaster Tools.

If you use the new version of Google Analytics, you'll now see a social engagement report. Use the social plugin to configure your site for different social media platforms to monitor the behavior of visitors coming from those platforms. Do those coming from Twitter convert better than those coming from Facebook? Do those who "+1" a page spend more time on it? Those are the sorts of questions the new social reports aim to answer.

You can also use Google Webmaster Tools to see how +1 activity is impacting how searchers interact with your pages in search results. In particular, you can see if the click-through rate of a result improves when it includes +1 annotations.

This is just one example of how the silos of the web are integrating. You shouldn't think of "social" users and "search" users when you are doing audience analysis for your site. You instead have one audience who many be coming to your site any number of ways. Engaging in social media can help your site be more visible in search, as results become more personalized and pages that our friends have shared, liked, and "plussed" show up more often for us.

Some may wonder if integrations like this mean that Google is weighting social signals more strongly in search. But those kinds of questions miss the point. The specific signals will continue to change, but the important thing is to engage your audiences wherever they are. The lines will continue to blur.

Google Realtime Search goes offline "temporarily"

A few day ago, Google's realtime search mysteriously disappeared. The reason: Google's agreement with Twitter expired and Google is now working on a new system to display realtime information. While this has temporarily impacted a number of results pages (such as top shared links and top tweets on Google News), it has not impacted Google's social results, which show results that your friends have shared.

Google social results

New Google UI

Google launched the first of many user interface updates last week, with the promise of many more changes to follow throughout the summer.

Google, Twitter and the FTC

But the Google world is not just about launches. The FTC formally notified Google that they are reviewing the business. Google says that they are "unclear exactly what the FTC's concerns are" but that they "focus on the user [and] all else will follow."

The Wall Street Journal reports that the investigation focuses on Google's core search advertising business, including "whether Google searches unfairly steer users to the company's own growing network of services at the expense of rival providers."

The FTC may also being investigating Twitter, due to how Twitter may be acquiring applications.

Android Open, being held October 9-11 in San Francisco, is a big-tent meeting ground for app and game developers, carriers, chip manufacturers, content creators, OEMs, researchers, entrepreneurs, VCs, and business leaders.

Save 20% on registration with the code AN11RAD

Google Plus (or is it +?)

Google PlusAnd of course we have to dig into that well-chronicled launch. As you're no doubt aware, Google launched their latest social effort last week: Google+. Or Google Plus. Or Plus. Or +. I don't know. But it's different from Plus One (+1?). Also it's not Wave, Buzz, Social Circles. Or Facebook.

I've just started using it, so I don't have a verdict on it yet, although I don't know that I buy intoGoogle's premise that "online sharing is awkward. Even broken." And that Google Plus will fix that. It doesn't mean I won't like the product, either. Google is of course under more scrutiny than usual since earlier social launches haven't gone over as well as they'd have liked. What do you all think of it?

Lots of sites have done comprehensive run downs, including:

(Google's Joseph Smarr, a member of the Google+ team, will discuss the future of the social web at OSCON. Save 20% on registration with the code OS11RAD.)

Yahoo search BOSS updates

Yahoo launched updates to their BOSS (Build your own search service) program. If you're a developer who uses Yahoo BOSS, you might be interested in the changes. and rel=author

A few weeks ago, Google, Microsoft, and Yahoo launched the alliance, which provides joint support for 100+ microdata formats. At the same time, Google announced support for rel=author, which enables site owners to provide structured markup on a page that specifies the author of the content.

The announcement seems to be a foundational announcement to encourage platform providers, such as content management system creators, to build in support of microdata formats for future use by the search engines.

On the other hand, Google has already launched integration of rel=author with search results. You can see examples of how this looks with results for the initial set of authors Google is working with.


June 06 2011

Google Correlate: Your data, Google's computing power

Google CorrelateGoogle Correlate is awesome. As I noted in Search Notes last week, Google Correlate is a new tool in Google Labs that lets you upload state- or time-based data to see what search trends most correlate with that information.

Correlation doesn't necessarily imply causation, and as you use Google Correlate, you'll find that the relationship (if any) between terms varies widely based on the topic, time, and space.

For instance, there's a strong state-based correlation between searches for me and searches for Vulcan Capital. But the two searches have nothing to do with each other. As you see below, the correlation is that the two searches have similar state-based interest.

Picture 476.png

For both searches, the most volume is in Washington state (where we're both located). And both show high activity in New York.

State-based data

For a recent talk I gave in Germany, I downloaded state-by-state income data from the U.S. Census Bureau and ran it through Google Correlate. I found that high income was highly correlated with searches for [lohan breasts] and low income was highly correlated with searches for [police shootouts]. I leave the interpretation up to you.

Picture 443.png

Picture 445.png

By default, the closest correlations are with the highest numbers, so to get correlations with low income, I multiplied all of the numbers by negative one.

Clay Johnson looked at correlations based on state obesity rates from the CDC. By looking at negative correlations (in other words, what search queries are most closely correlated with states with the lowest obesity rates), we see that the most closely related search is [yoga mat bags]. (Another highly correlated term is [nutrition school].)

Picture 478.png

Maybe there's something to that "working out helps you lose weight" idea I've heard people mention. Then again, another highly correlated term is [itunes movie rentals], so maybe I should try the "sitting on my couch, watching movies work out plan" just to explore all of my options.

To look at this data more seriously, we can see with search data alone that the wealthy seem to be healthier (at least based on obesity data) than the poor. In states with low obesity rates, searches are for optional material goods, such as Bose headphones, digital cameras, and red wine and for travel to places like Africa, Jordan, and China. In states with high obesity rates, searches are for jobs and free items.

With this hypothesis, we can look at other data (access to nutritious food, time and space to exercise, health education) to determine further links.

Time-based data

Time-based data works in a similar way. Google Correlate looks for matching patterns in trends over time. Again, that the trends are similar doesn't mean they're related. But this data can be an interesting starting point for additional investigation.

One of the economic indicators from the U.S. Census Bureau is housing inventory. I looked at the number of months' supply of homes at the current sales rate between 2003 and today. I have no idea how to interpret data like this (the general idea is that you, as an expert in some field, would upload data that you understand). But my non-expert conclusion here is that as housing inventory increases (which implies no one's buying), we are looking to spiff up our existing homes with cheap stuff, so we turn to Craigslist.

Picture 481.png

Picture 482.png

Picture 483.png

Of course, it could also be the case that the height of popularity of Craiglist just happened to coincide with the months when the most homes were on the market, and both are coincidentally declining at the same rate.

Search-based data

You can also simply enter a search term, and Google will analyze the state or time-based patterns of that term and chart other queries that most closely match those patterns. Google describes this as a kind of Google Trends in reverse.

Google Insights for Search already shows you state distribution and volume trends for terms, and Correlate takes this one step further by listing all of the other terms with a similar regional distribution or volume trend.

For instance, regional distribution for [vegan restaurants] searches is strongly correlated to the regional distribution for searches for [mac store locations].

Picture 484.png

What does the time-trend of search volume for [vegan restaurants] correlate with? Flights from LAX.

Picture 485.png

Time-based data related to a search term can be a fascinating look at how trends spark interest in particular topics. For instance, as the Atkins Diet lost popularity, so too did interest in the carbohydrate content of food.

Picture 486.png

Interest in maple syrup seems to follow interest in the cleanse diet (of which maple syrup is a key component).

Picture 488.png

Drawing-based data

Don't have any interesting data to upload? Aren't sure what topic you're most interested in? Then just draw a graph!

Maybe you want to know what had no search volume at all in 2004, spiked in 2005, and then disappeared again. Easy. Just draw it on a graph.

Picture 489.png

Apparently the popular movies of the time were "Phantom of the Opera," "Darkness," and "Meet the Fockers." And we all were worried about our Celebrex prescriptions.

Picture 490.png

Picture 491.png

(Note: the accuracy of this data likely is dependent on the quality of your drawing skills.)

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD


March 31 2011

Outliers and coexistence are the new normal for big data

Letting data speak for itself through analysis of entire data sets is eclipsing modeling from subsets. In the past, all too often what were once disregarded as "outliers" on the far edges of a data model turned out to be the telltale signs of a micro-trend that became a major event. To enable this advanced analytics and integrate in real-time with operational processes, companies and public sector organizations are evolving their enterprise architectures to incorporate new tools and approaches.

Whether you prefer "big," "very large," "extremely large," "extreme," "total," or another adjective for the "X" in the "X Data" umbrella term, what's important is accelerated growth in three dimensions: volume, complexity and speed.

Big data is not without its limitations. Many organizations need to revisit business processes, solve data silo challenges, and invest in visualization and collaboration tools to make big data understandable and actionable across an extended organization.

"Sampling is dead"

When complete huge data volumes can be processed and analyzed at scale, "sampling is dead," says Abhishek Mehta, former Bank of America (BofA) managing director and Tresata co-founder, and speaker at last year's Hadoop World. Potential applications include risk default analysis of every loan in a bank's portfolio and analysis of granular data for targeted advertising.

The BofA corporate investments group adopted a SAS high performance risk management solution together with IBM BladeCenter grid and XIV storage to power credit-risk modeling, scoring and loss forecasting. As explained in a recent call with the SAS high-performance computing team, this new enterprise risk management system reduced calculation times at BofA for forecasting the probability of loan defaults from 96 hours to four hours. In addition to speeding up loan processing and hedging decisions, Bank of America can aggregate bottom-up data from individual loans for perhaps a more accurate picture of total risk than what was possible previously by testing models on just subsets of data.

nPario holds an exclusive license from Yahoo for technology based on columnar storage that within Yahoo's internal infrastructure handles over eight petabytes of data for advertising and promotion, per a February 2011 discussion with nPario President and CEO Bassel Y. Ojjeh. nPario has basically forked the code, so that Yahoo can continue their internal use while nPario goes to market with a commercial offering for external customers. The nPario technology enables analysis at the granular level, not just at aggregate or sampled data. In addition to supporting a range of other data sources, nPario offers full integration with Adobe Omniture, including APIs that can pull data from Omniture (although Omniture charges a fee for this download).

Electronic Arts uses nPario for an "insight's suite" that details how gamers engage with advertising. The nPario-powered EA analytics suite tracks clicks, impressions, demographic profiles, social media buzz and other data across EA's online, console game, mobile and social channels. The result is a much more precise understanding of consumer intent and ability to micro-target ads, over what was previously possible either with sampled data or with data limited to just online or shrink-wrapped and not across the complete range of EA's customer engagement.

Multiple big data technologies coexist in many enterprise architectures

CoexistenceIn many cases, organizations will use a mix-and-match combination of relational database management systems (RDBMS), Hadoop/MapReduce, R, columnar databases such as HP Vertica or ParAccel, or document-oriented databases. Also, there is growing adoption this year beyond just the financial services industry and government for complex event processing (CEP) and related real-time or near-real-time technologies to take action from web, IT, sensor and other streaming data.

At the same time that outliers are the new normal in data science, coexistence is quickly becoming the new normal for big data infrastructure and service architectures. For many enterprises and public sector organizations, the focus is "the right tool for the job" to manage structured, unstructured and semi-relational data from disparate sources. A few examples:

The Strata Online Conference, being held April 6, will look at how information — and the ability to put it to work — will shape tomorrow's markets. Scheduled speakers include: Gavin Starks from AMEE, Jeff Jonas from IBM, Chris Thorpe from Artfinder, and Ian White from Urban Mapping.

Registration is open
  • AOL Advertising integrated two data management systems: one optimized for high-throughput data analysis (the "analytics" system), the other for low-latency random access (the "transactional" system). After evaluating alternatives, AOL Advertising combined Cloudera Distribution for Apache Hadoop (CDH) with Membase (now Couchbase). This pairs Hadoop's capability for handling large, complex data volumes with Membase's capability for speed for sub-millisecond latency in making optimized decisions for real-time ad placement.
  • At LinkedIn, to power large-scale data computations of more than 100 billion relationships a day and low-latency site serving, they use a combination of Hadoop to process massive batch workloads, Project Voldemort, for a NoSQL key/value storage engine, and the Azkaban open-source workflow system. Further, they developed a real-time, persistent messaging system named Kafka for log aggregation and activity processing.
  • The Walt Disney Co. Technology Shared Services Group extended its existing data warehouse architecture with a Hadoop cluster to provide an integration mashup for diverse departmental data, most of which is stored separately by Disney's many business units and subsidiaries. With a Hadoop cluster that went into production for shared service internal business units last October, this data can now be analyzed for patterns across different but connected customer activities, such as attendance at a theme park, purchases from Disney stores, and viewership of Disney's cable television programming. (Disney case study summarized from PricewaterhouseCoopers, Technology Forecast, Big Data issue, 2010).

Centralization and coexistence at eBay

Even companies whose enterprise architecture more closely aligns with the enterprise data warehouse (EDW) vision associated with Bill Inmon than the federated model popularized by Ralph Kimball are finding themselves migrating their architectures toward greater coexistence to empower business growth. eBay offers an instructive example.

"A data mart can't be cheap enough to justify its existence," says Oliver Ratzesberger, eBay's senior director of architecture and operations. eBay has migrated to coexistence architecture featuring Teradata as the core EDW, Teradata offshoot named Singularity for behavioral analysis and clickstream semi-relational data, and Hadoop for image processing and deep data mining. All three store multiple petabytes of data.

Named after Ray Kurzweil's thought-provoking book "The Singularity is Near," the Singularity system at eBay is running production for managing and analyzing semi-relational data, using the same Teradata SQL user interfaces that are already widely understood and liked by many eBay staff. eBay's Hadoop instances still require separate management tools, and to date, still come with fewer capabilities for workload management than what eBay receives with its Teradata architecture.

Using this tripartite architecture, on eBay's consumer online marketplace, there are no static pages. Every page is dynamic, and many if not yet all ads are individualized. These technical innovations at eBay are helping to empower eBay's corporate resurgence, as highlighted in the March 2011 Harvard Business Review "How eBay Developed a Culture of Experimentation" interview with eBay CEO John Donahoe.

Coexistence at Bank of America

Bank of America operates a Teradata data warehouse architecture with Hadoop, R and columnar extensions along with: IBM Cognos business intelligence, InfoSphere Foundation Tools and InfoSphere DataStage; Tableau reporting; SAP global ERP reporting system; and Cisco telepresence for internal collaboration; among other technologies and systems.

R-specialist Revolution Analytics cites a Bank of America reference. In it, Mike King, a quantitative analyst at Bank of America, describes how he uses R to write programs for capital adequacy modeling, decision systems design and predictive analytics:

R allows you to take otherwise overwhelmingly complex data and view it in such a way that, all of a sudden, the choice becomes more intuitive because you can picture what it looks like. Once you have that visual image of the data in your mind, it's easier to pick the most appropriate quantitative techniques.

While Revolution Analytics is sponsoring a SAS to R Challenge for SAS customers to consider converting to R, coexistence between enterprise-grade software such as SAS and emerging tools such as R, is a more common outcome than a replacement or cutback in the number of current or future SAS licenses, as shown by Bank of America's recent investment described above in the SAS risk management offering.

For its part, SAS indicates that SAS/IML Studio (formerly known as SAS Stat Studio) provides one existing capability to interface with the R language. According to Radhika Kulkarni, vice president of advanced analytics at SAS, in a discussion about SAS-R integration on the SAS website: "We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure."

To quote Bob Rodriguez, senior director of statistical development at SAS, from that website discussion: "R is a leading language for developing new statistical methods. Our new PhD developers learned R in their graduate programs and are quite versed in it." The SAS article added that: "Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers."

Recent evolutions in big data vendors

As 10gen CEO and co-founder Dwight Merriman and new President Max Schireson described in a call March 8: "There have been periodic rebellions against the RDBMS." Intuit's small business division uses document-oriented MongoDB from 10gen for real-time tracking of website user engagement and user activities. Document-oriented CouchDB supporter CouchOne merged with key value store and memcached specialist Membase to form Couchbase; their customers include AOL and social gaming leader Zynga.

Customers had asked DataStax (previously named Riptano) for a roadmap for integrated Cassandra and Hadoop management, per an O'Reilly Strata conference discussion with DataStax CEO and co-founder Matt Pfeil and products VP Ben Werther. In March 2011, DataStax announced the Brisk integrated Hadoop, Hive and Cassandra platform, to support high-volume, high-velocity websites and complex event processing, among other applications that require real-time or near-real-time processing. According to DataStax VP of Products Ben Werther in a March 29 email: "Cassandra is at the core of Brisk and eliminates the need for HBase because it natively provides low-latency access and everything you'd get in HBase without the complexity."

Originating at Facebook and with commercial backing from DataStax, Cassandra is in use at Cisco, Facebook, Ooyala, Rackspace/Cloudkick, SimpleGeo, Twitter and other organizations that have large, active data sets. It's basically a BigTable data model running on an Amazon Dynamo like infrastructure. DataStax's largest Cassandra production cluster has more than 700 nodes. Cloudkick, acquired by Rackspace, offers a good discussion of their selection process that led to use of Cassandra: 4 months with Cassandra, a love story.

While EMC/Greenplum and Teradata/Aster Data started with PostgreSQL and moved forward from there, EnterpriseDB has continued to incorporate PostgreSQL updates. EnterpriseDB CEO Ed Boyajian and VP Karen Tegan Padir explained in a call last month that while much of the PostgreSQL initial work was to build databases for sophisticated users, EnterpriseDB has done more to improve manageability and ease of use, including a 1-click installer for PostgreSQL similar to Red Hat installer for Linux. EnterpriseDB envisions becoming for PostgreSQL what Cloudera has become for Hadoop: an integrated solution provider aimed a commercial, enterprise and public-sector accounts.

MicroStrategy is one of Cloudera's key partners for visualization and collaboration, and Informatica is quickly becoming a strong partner for ETL. To speed up what can be slow transfers in ODBC, Cloudera is building an optimized version of Sqoop. Flume agents support CEP applications, but it's not a big use case yet for Hadoop, per a call in February with Dr. Amr Awadallah, co-founder and VP of engineering, and marketing VP John Kreisa.

The following are additional examples of big data integration and coexistence efforts based on phone and in-person discussions with vendor executives in February and March 2011:

  • Adobe acquired data management platform vendor Demdex to integrate with the Omniture in the Adobe Online Marketing Suite. Demdex helps advertisers shift dollars and focus from buying content-driven placements to buying specific audiences.
  • Appistry extended its CloudIQ Storage with a Hadoop edition and partnership with Accenture for a Cloud MapReduce offering for private clouds. This joint offering runs MapReduce jobs on top of the Appistry CloudIQ Platform for behind-the-firewall corporate applications.
  • Together with its siblings Cassandra and Project Voldemort, Riak is an Dynamo-inspired database that Comcast, Mozilla and others use to prototype, test and deploy applications, with commercial support and services from Basho Technologies.
  • At CloudScale, CEO Bill McColl and his team offer a platform to help developers create applications designed for real-time distributed architectures.
  • Clustrix's clustered database system looks like a MySQL database "on the wire," but without MySQL code, to combine key-value stores with relational database functionality, with a focus on online transaction processing (OLTP) applications.
  • Concurrent supports an open source abstraction for MapReduce called Cascading that allows applications to integrate with Hadoop through Java API.
  • Within an enterprise and extending to its SaaS or social media data, Coveo offer integrated search tools for finding information quickly. For example, a Coveo user can search Microsoft SharePoint files or pull up data from all from within her Outlook email browser.
  • Germany-based Exasol added a bulk-loader and increased integration capabilities for SAP clients.
  • Based on Big Table and other Google technologies, Fusion Tables are a service for managing large collections of tabular data in the cloud. You can upload tables of up to 100MB and share them with collaborators, or make them public. You can apply filters and aggregation to your data, visualize it on maps and other charts, merge data from multiple tables, and export it to the web or csv files.
  • Yale's Daniel Abadi and several of his colleagues unveiled Hadapt to run large and ad-hoc SQL queries with high velocity on both structured and unstructured data in Hadoop, to commercialize a project that began in the Yale computer science department.
  • IBM Netezza has partnered with R specialist Revolution Analytics add built-in R capabilities to the IBM Netezza TwinFin Data Warehouse Appliance. While Revolution Analytics has challenged SAS, they see more of a partner model with IBM Netezza and IBM SPSS. This may in part reflect the work career of Revolution Analytics President and CEO Norman Nie; prior to his current role, he co-invented SPSS.
  • Mapr targets speeding up Hadoop/MapReduce through a proprietary replacement for HDFS that can integrate with the rest Apache Hadoop ecosystem. (For a backgrounder on that ecosystem, refer to Meet the Big Data Equivalent of the LAMP Stack).
  • MarkLogic offers a purpose-built database using an XML data model for unstructured information for Simon & Schuster, Pearson Education, Boeing, the U.S. Federal Aviation Administration and other customers.
  • Microsoft Dryad offers a programming model to write parallel and distributed programs to scale from a small cluster to a large data center.
  • Pentaho offers an open source BI suite integrating capabilities for ETL, reporting, OLAP analysis, dashboards and data mining.
  • With its SpringSource and Wavemaker acquisitions, VMware is offering and expanding a suite of tools for developers to program applications that take advantage of virtualized cloud delivery environments. VMware's cloud application strategy is to empower developers to run modern applications that share information with underlying infrastructure to maximize performance, quality of service and infrastructure utilization. This extends VMware's virtualization business farther up into the software development lifecycle and provides incremental revenue for VMware while VMware positions itself for desktop virtualization to take off.

Data in the cloud

Data in cloudCloud computing and big data technologies overlap. As Judith Hurwitz at Hurwitz & Associates explained in a call on February 22: "Amazon has definitely blazed the trail as the pioneer for compute services." Amazon found they had extra capacity and started renting it out, but with little or no service level guarantees.

As Judith Hurwitz discussed, the data in the cloud market is starting to bifurcate. Private clouds are advancing the enterprise shared services model with workload management, self-provisioning and other automation of shared services. IBM, Unisys, Microsoft Azure, HP, NaviSite (Time Warner) and others are offering enterprise-grade services. While data in Amazon is pretty portable -- most services link with Amazon -- many APIs and tools are still specific to one environment, or reflect important dependencies, e.g., Microsoft Azure basically assumes a .Net infrastructure.

At the 1000 Genomes Project, medical researchers are benefiting from a cloud architecture to access data for genomics research, including the ability to download a public dataset through Amazon Web Services. For medical researchers on limited budgets, using the cloud capacity for analytics can save investment dollars. However, Amazon pricing can be deceptive as CPU hours can add up to quite a lot of money over time. To speed data transfers from the cloud, the project participants are using Aspera and its fasp protocol.

The University of Washington, Monterey Bay Aquarium Research Institute and Microsoft have collaborated on Project Trident to provide a scientific workflow workbench for oceanography. Trident, implemented with Windows Workflow Foundation, .NET, Silverlight and other Microsoft technologies, allows scientists to explore and visualize oceanographic data in real-time. They can use Trident to compose, run and catalog oceanography experiments from any web browser.

Pervasive DataCloud adds a data services layer to Amazon Web Services for integration and transformation capabilities. An enterprise with multiple CRM systems can synchronize application data from Oracle/Siebel, and partner applications within a Pervasive DataCloud2 process. They can then use the feeds from that DataCloud process to power executive dashboards or business analytics. Likewise, an enterprise with data can use DataCloud2 to synch with an on-premise relational database, or synch data between and Intuit QuickBooks accounting software.

Big data jobs

All of this activity is welcome news for software engineers and other technical staff whose jobs may have been affected by overseas outsourcing. The monthly Hadoop user group meetups at the Yahoo campus now feature hundreds of attendees and even some job offers: many big data mega vendors and startups are hiring. For example, while Yahoo ended its own distribution of Hadoop, it has some interesting work underway with its Cloud Data Platform and Services including job openings there.

Cloudera counts 85 employees and continues to hire. Cloudera's Hadoop training courses are consistently sold out, including big demand from public sector organizations; the venture capital arm of the CIA, In-Q-Tel, became a Cloudera investor last month.

Recognizing big data's limits

To temper enthusiasm just a bit, 2011 is also a good time for a reality check to put big data into perspective. To benefit from big data, many enterprises and public sector organizations need to revisit business processes, solve data silo challenges, invest in virtualization and collaboration tools to help make big data understandable and actionable across an extended organization, and encourage more staff to develop "T-shaped" skills that combine deep technical experience (the T's vertical line) and wide business skills (the T's horizontal line).

Also, big data applications such as risk management software will not by themselves prevent the next sub-prime mortgage meltdown or the previous generation's savings and loan industry crisis. Decision-makers at financial institutions will need to make the right risk decisions, and regulatory oversight such as the new Basel rules for minimum capital requirements may play an important role too.

For more on big data technology and business trends, including a longer discussion on big data limitations, take a look at my recently published Putting Big Data to Work: Opportunities for Enterprises report on GigaOM Pro.

With sentiment analysis, context always matters

People are finding new ways to use sentiment analysis tools to conduct business and measure market opinion. But is such analysis really effective, or is it too subjective to be relied upon?

In the following interview, Matthew Russell (@ptwobrussell), O'Reilly author and principal and co-founder of Zaffra, says the quality of sentiment analysis depends on the methodology. Large datasets, transparent methods, and remembering that context matters, he says, are key factors.

What is sentiment analysis?

Matthew RussellMatthew Russell: Think of sentiment analysis as "opinion mining," where the objective is to classify an opinion according to a polar spectrum. The extremes on the spectrum usually correspond to positive or negative feelings about something, such as a product, brand, or person. For example, instead of taking a poll, which essentially asks a sample of a population to respond to a question by choosing a discrete option to communicate sentiment, you might write a program that mines relevant tweets or Facebook comments with the objective of scoring them according to the same criteria to try and arrive at the same result.

What are the flaws with sentiment analysis? How can something like sarcasm be addressed?

Matthew Russell: Like all opinions, sentiment is inherently subjective from person to person, and can even be outright irrational. It's critical to mine a large — and relevant — sample of data when attempting to measure sentiment. No particular data point is necessarily relevant. It's the aggregate that matters.

An individual's sentiment toward a brand or product may be influenced by one or more indirect causes &dmash; someone might have a bad day and tweet a negative remark about something they otherwise had a pretty neutral opinion about. With a large enough sample, outliers are diluted in the aggregate. Also, since sentiment very likely changes over time according to a person's mood, world events, and so forth, it's usually important to look at data from the standpoint of time.

As to sarcasm, like any other type of natural language processing (NLP) analysis, context matters. Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue, and sarcasm and other types of ironic language are inherently problematic for machines to detect when looked at in isolation. It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account. For example, that would require knowing that a particular user is generally sarcastic, ironic, or hyperbolic, or having a larger sample of the natural language data that provides clues to determine whether or not a phrase is ironic.

Is the phrase "sentiment analysis" being used appropriately?

Matthew Russell: I've never had a problem with the phrase "sentiment analysis" except that it's a little bit imprecise in that it says nothing about how the analysis is being conducted. It only describes what is being analyzed — sentiment. Given the various flaws I've described, it's pretty clear that the analysis techniques can sometimes be as subjective as the sentiment itself. Transparency in how the analysis occurs and additional background data — such as the context of when data samples were gathered, what we know about the population that generated them, and so forth — is important. Of course, this is the case for any test involving non-trivial statistics.

Sentiment analysis recently was in the news, touted as an effective tool for predicting stock market prices. What other non-marketing applications might make use of this sort of analysis?

Matthew Russell: The stock market prices is potentially a problematic example because it's not always the case that a company that creates happy consumers is necessarily profitable. For example, key decision makers could still make poor fiscal decisions or take on bad debt. Like anything else involving sentiment, you have to hold the analysis loosely.

A couple examples, though, might include:

  • Politicians could examine the sentiment of their constituencies over time to try and gain insight into whether or not they are really representing the interests that they should be. This could possibly involve realtime analysis for a controversial topic, or historical analysis to try and identify trends such as why a "red state" is becoming a "blue state," or vice-versa. (Sentiment analysis is often looked at as a realtime activity, but mining historical samples can be incredibly relevant too.)
  • Conference organizers could use sentiment analysis based on book sales or related types of data to identify topics of interest for the schedule or keynotes.

Of course, keep in mind that just because the collective sentiment of a population might represent what the population wants, it's not necessarily the case that it's in its best interests.


March 29 2011

For publishing, traditional sales info is the tip of the data iceberg

DataArt.jpgIn this month's issue of Wired, Kevin Kelly interviewed author James Gleick about his new book "The Information." At one point in the interview, Gleick talked about the perception of data on a universal scale:

Modern physics has begun to think of the bit — this binary choice — as the ultimate fundamental particle. John Wheeler summarized the idea as "it-from-bit." By that he meant that the basis of the physical universe — the "it" of an atom or subatomic particle — is not matter, nor energy, but a bit of information.

While data as the basis for the universe will likely remain the subject of scientific debate, data is rapidly proving to be the basis of successful business models. A recent GigaOM story touched on the increasing volume of data being generated in relation to the publishing landscape:

For Barnes & Noble, the data they are dealing with is exploding. It's a big, rapid change: They have 35 terabytes of data currently, but expect 20 terabytes in 2011 ... The challenge now for book sellers is to merge the dot-com website, mobile devices, and brick-and-mortar stores for a seamless experience.

Where do publishers go to gather this data, and what do they do with it once it's in-hand? I turned to Kirk Biglione, partner at Oxford Media Works for answers. In an email interview, he offered up practical ways to gather and use the sheer amount of data being generated. He also noted that traditional sales channel data, while important, is just the tip of the iceberg.

For publishers, what are the most important types of data generated?

Kirk BiglioneKirk Biglione: All forms of data are important: traditional sales data, web data, data from interactive apps, and market research data. As publishing goes digital, publishers are being inundated with new types of data. The challenge is making sense of it all and understanding how different metrics relate not only to the bottom line, but to intangibles like consumer experience.

What are the best sources to use for gathering this data?

Kirk Biglione: Traditional sales channel data is still very important, but there are quite a few new sources of data that publishers will want to consider:

Web analytic reports. These provide huge amounts of data on how a publisher's website is performing. For publishers who sell direct via their websites, web analytics provide valuable insight into critical metrics like conversion and shopping cart abandonment. Also, publishers who sell direct are in a position to collect a wealth of customer data that likely isn't available through traditional sales channels.

Search analytics — a variation on web analytics. Publishers will want to consider both on-site and off-site search analytics:

  1. Off-site: How are users finding your website? What keywords and phrases bring them to your site? Are you reaching the desired audience by ranking for the best phrases? Questions like these will likely lead smart publishers to perform a competitive search engine optimization analysis (which produces even more data).
  2. On-site: What are users searching for once they get to your website? Very few websites actually record on-site search phrases for later analysis. It's a shame because search logs reveal quite a bit about customers and their intentions.

In-App analytics. How frequently are customers using your app? How long are their sessions? What time of day? Which features are they using the most? Which features are they not using at all? This is the kind of consumer usage data that is impossible to collect from print (or traditional ebooks, for that matter).

Social networks. These can provide valuable data on consumer engagement.

What are practical ways publishers can make use of this data to monetize, adapt, and market products?

Kirk Biglione: Some examples using the data sources above:

  • Search analytics can be used to optimize a publishers website for better ranking in organic search results. That will lead to lower search marketing costs, increased discovery, and presumably more sales.
  • Web and on-site search analytics can be used to improve a website's information architecture, help consumers find what they're looking for, and eliminate barriers to completing online purchases.
  • In-app analytics can be used to develop better digital products by providing publishers with insight into which app features consumers value the most.

Top photo: Data, Information, Knowledge, Wisdom 0.1, by Michael Kreil on Flickr


March 14 2011

Four short links: 14 March 2011

  1. A History of the Future in 100 Objects (Kickstarter) -- blog+podcast+video+book project, to have future historians tell the story of our century in 100 objects. The BBC show that inspired it was brilliant, and I rather suspect this will be too. It's a clever way to tell a story of the future (his hardest problem will be creating a single coherent narrative for the 21st century). What are the 100 objects that future historians will use to sum up our century? 'Smart drugs' that change the way we think? A fragment from suitcase nuke detonated in Shanghai? A wedding ring between a human and an AI? The world's most expensive glass of water, returned from a private mission to an asteroid? (via RIG London weekly notes)
  2. Entrepreneurs Who Create Value vs Entrepreneurs Who Lock Up Value (Andy Kessler) -- distinguishes between "political entrepreneurs" who leverage their political power to own something and then overcharge or tax the crap out of the rest of us to use it vs "market entrepreneurs" who recognize the price-to-value gap and jump in. Ignoring legislation, they innovate, disintermediate, compete, stay up all night coding, and offer something better and cheaper until the market starts to shift. My attention was particularly caught by for every stroke of the pen, for every piece of legislation, for every paid-off congressman, there now exists a price umbrella that overvalues what he or any political entrepreneur is doing. (via Bryce Roberts)
  3. Harper-Collins Caps eBook Loans -- The publisher wants to sell libraries DRMed ebooks that will self-destruct after 26 loans. Public libraries have always served and continue to serve those people who can't access information on the purchase market. Jackass moves like these prevent libraries from serving those people in the future that we hope will come soon: the future where digital is default and print is premium. That premium may well be "the tentacles of soulless bottom-dwelling coprocephalic publishers can't digitally destroy your purchase". It's worth noting that O'Reilly offers DRM-free PDFs of the books they publish, including mine. Own what you buy lest it own you. (via BoingBoing and many astonished library sources)
  4. MAD Lib -- BSD-licensed open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. (via Ted Leung)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!