Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 30 2013

Leading Indicators

In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.

Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.

This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.

How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:

  • Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
  • Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
  • When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
  • What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
  • Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
  • What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
  • Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?

April 22 2013

A different take on data skepticism

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …

There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

So, which methods are better in this regard? In general, it’s those that explore model space in addition to model parameters. In the case of k-means, for example, this would mean learning the number k in addition to the cluster assignment for each data point. For k-nearest neighbors, we could learn the number of exemplars to use and also the distance metric that provides the best explanation for the data. This multi-level approach might sound advanced, and it is true that these implementations are more complex. But complexity of implementation needn’t correlate with “danger” (thanks in part to software engineering), and it’s certainly not a sufficient reason to dismiss more robust methods.

I find the database analogy useful here: developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems — i.e., the “professionals.” How? Well, decades of experience — and lots of trial and error — have yielded good abstractions in this area. As a result, we can meaningfully talk about the database “layer” in our overall “stack.” Of course, these abstractions are leaky, like all others, and there are plenty of sharp edges remaining (and, some might argue, more being created every day with the explosion of NoSQL solutions). Nevertheless, my weekend-project webapp can store and query insane amounts of data — and I have no idea how to implement a B-tree.

For ML to have a similarly broad impact, I think the tools need to follow a similar path. We need to push ourselves away from the viewpoint that sees ML methods as a bag of tricks, with the right method chosen on a per-problem basis, success requiring a good deal of art, and evaluation mainly by artificial measures of accuracy at the expense of other considerations. Trustworthiness, robustness, and conservatism are just as important, and will have far more influence on the long-run impact of ML.

Will well-intentioned people still be able to lie to themselves? Sure, of course! Let alone the greedy or malicious actors that Cathy and Mike are also concerned about. But our tools should make the common cases easy and safe, and that’s not the reality today.

Related:

Sponsored post
feedback2020-admin

April 11 2013

Data skepticism

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

April 09 2013

The re-emergence of time-series

My first job after leaving academia was as a quant 1 for a hedge fund, where I performed (what are now referred to as) data science tasks on financial time-series. I primarily used techniques from probability & statistics, econometrics, and optimization, with occasional forays into machine-learning (clustering, classification, anomalies). More recently, I’ve been closely following the emergence of tools that target large time series and decided to highlight a few interesting bits.

Time-series and big data

Over the last six months I’ve been encountering more data scientists (outside of finance) who work with massive amounts of time-series data. The rise of unstructured data has been widely reported, the growing importance of time-series much less so. Sources include data from consumer devices (gesture recognition & user interface design), sensors (apps for “self-tracking”), machines (systems in data centers), and health care. In fact some research hospitals have troves of EEG and ECG readings that translate to time-series data collections with billions (even trillions) of points.

Search and machine-learning at scale

Before doing anything else, one has to be able to run queries at scale. Last year I wrote about a team of researchers at UC Riverside who took an existing search algorithm (dynamic time-warping 2) and got it to scale to time-series with trillions of points. There are many potential applications of their research, one I highlighted is from health care:

… a doctor who needs to search through EEG data (with hundreds of billions of points), for a “prototypical epileptic spike”, where the input query is a time-series snippet with thousands of points.

As the size of data grows, the UCR dynamic time-warping algorithm takes time to finish (it takes a few hours for time-series with trillions of points). In general (academic) researchers who’ve spent weeks or months collecting data are fine waiting a few hours for a pattern recognition algorithm to finish. But users who come from different backgrounds (e.g. web companies) may not be as patient. Fortunately “search” is an active research area and faster (distributed) pattern recognition systems will likely emerge soon.

Once you scale up search, other interesting problems can be tackled. The UCR team is using their dynamic time-warping algorithm in tasks like classification, clustering, and motif 3 discovery. Other teams are investigating techniques from signal-processing, pattern recognition, and trajectory tracking.

Some data management tools that target time-series

One of the more popular sessions at last year’s HBase Conference was on OpenTSDB, a distributed, time series database built on top of HBase. It’s used to store and serve time series metrics, and comes with tools (based on GNUPlot) for charting. Originally named OpenTSDB2, KairosDB was written primarily for Cassandra (but also works with HBase). OpenTSDB emphasizes tools for readying data for charts (interpolating to fill in missing values), KairosDB distinguishes between data and the presentation of data.

Startup TempoDB offers a reasonably priced, cloud-based service for storing, retrieving, and visualizing time-series data. Still a work in progress SciDB is an open source database project, designed specifically for data intensive science problems. The designers of the system plan to make time-series analysis easy to express within SciDB.


(1) I worked on trading strategies for derivatives, portfolio & risk management, and option pricing.

(2) From my earlier post: In a recent paper, the UCR team noted that “… after an exhaustive literature search of more than 800 papers, we are not aware of any distance measure that has been shown to outperform DTW by a statistically significant amount on reproducible experiments”.

(3) Motifs are similar subsequences of a long time series; shapelets are time series primitives that can be used to speed up automatic classification (by reducing the number of “features”).

This post was originally published on strata.oreilly.com.

April 05 2013

Four short links: 5 April 2013

  1. Millimetre-Accuracy 3D Imaging From 1km Away (The Register) — With further development, Heriot-Watt University Research Fellow Aongus McCarthy says, the system could end up both portable and with a range of up to 10 Km. See the paper for the full story.
  2. Robot Ants With Pheromones of Light (PLoS Comp Biol) — see also the video. (via IEEE Spectrum’s AI blog)
  3. tabula — open source tool for liberating data tables trapped inside PDF files. (via Source)
  4. There’s No Economic Imperative to Reconsider an Open Internet (SSRN) — The debate on the neutrality of Internet access isn’t new, and if its intensity varies over time, it has for a long while tainted the relationship between Internet Service Providers (ISPs) and Online Service Providers (OSPs). This paper explores the economic relationship between these two types of players, examines in laymen’s terms how the traffic can be routed efficiently and the associated cost of that routing. The paper then assesses various arguments in support of net discrimination to conclude that there is no threat to the internet economy such that reconsidering something as precious as an open internet would be necessary. (via Hamish MacEwan)

April 04 2013

Four short links: 4 April 2013

  1. geo-bootstrap — Twitter Bootstrap fork that looks like a classic geocities page. Because. (via Narciso Jaramillo)
  2. Digital Public Library of America — public libraries sharing full text and metadata for scans, coordinating digitisation, maximum reuse. See The Verge piece. (via Dan Cohen)
  3. Snake Robots — I don’t think this is a joke. The snake robot’s versatile abilities make it a useful tool for reaching locations or viewpoints that humans or other equipment cannot. The robots are able to climb to a high vantage point, maneuver through a variety of terrains, and fit through tight spaces like fences or pipes. These abilities can be useful for scouting and reconnaissance applications in either urban or natural environments. Watch the video, the nightmares will haunt you. (via Aaron Straup Cope)
  4. The Power of Data in Aboriginal Hands (PDF) — critique of government statistical data gathering of Aboriginal populations. That ABS [Australian Bureau of Statistics] survey is designed to assist governments, commentators or academics who want to construct policies that shape our lives or encourage a one-sided public discourse about us and our position in the Australian nation. The survey does not provide information that Indigenous people can use to advance our position because the data is aggregated at the national or state level or within the broad ABS categories of very remote, remote, regional or urban Australia. These categories are constructed in the imagination of the Australian nation state. They are not geographic, social or cultural spaces that have relevance to Aboriginal people. [...] The Australian nation’s foundation document of 1901 explicitly excluded Indigenous people from being counted in the national census. That provision in the constitution, combined with Section 51, sub section 26, which empowered the Commonwealth to make special laws for ‘the people of any race, other than the Aboriginal race in any State’ was an unambiguous and defining statement about Australian nation building. The Founding Fathers mandated the federated governments of Australia to oversee the disappearance of Aboriginal people in Australia.

April 03 2013

Four short links: 3 April 2013

  1. Capn Proto — open source faster protocol buffers (binary data interchange format and RPC system).
  2. Saddle — a high performance data manipulation library for Scala.
  3. Vegaa visualization grammar, a declarative format for creating, saving and sharing visualization designs. (via Flowing Data)
  4. dumpmon — Twitter bot that monitors paste sites for password dumps and other sensitive information. Source on github, see the announcement for more.

March 27 2013

Four short links: 27 March 2013

  1. The Effect of Group Attachment and Social Position on Prosocial Behavior (PLoSone) — notable, in my mind, for We conducted lab-in-the-field experiments involving 2,597 members of producer organizations in rural Uganda. cf the recently reported “rich are more selfish than poor” findings, which (like a lot of behavioural economics research) studies Berkeley undergrads who weren’t smart enough to figure out what was being studied.
  2. elephanta HTTP key/value store with full-text search and fast queries. Still a work in progress.
  3. geary (IndieGoGo) — a beautiful modern open-source email client. Found this roughly the same time as elasticinbox open source, reliable, distributed, scalable email store. Open source email action starting?
  4. The Faraday Copter (YouTube) — Tesla coil and quadrocopter madness. (via Jeff Jonas)

March 21 2013

March 18 2013

Four short links: 18 March 2013

  1. A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method (PDF) — This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project. Even litcrit becoming a data game.
  2. Easy6502get started writing 6502 assembly language. Fun way to get started with low-level coding.
  3. How Analytics Really Work at a Small Startup (Pete Warden) — The key for us is that we’re using the information we get primarily for decision-making (should we build out feature X?) rather than optimization (how can we improve feature X?). Nice rundown of tools and systems he uses, with plug for KissMetrics.
  4. webgl-heatmap (GitHub) — a JavaScript library for high performance heatmap display.

March 08 2013

Four short links: 8 March 2013

  1. mlcompa free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
  2. Printing Code: Programming and the Visual Arts (Vimeo) — Rune Madsen’s talk from Heroku’s Waza. (via Andrew Odewahn)
  3. What Data Brokers Know About You (ProPublica) — excellent run-down on the compilers of big data about us. Where are they getting all this info? The stores where you shop sell it to them.
  4. Subjective Impressions Do Not Mirror Online Reading Effort: Concurrent EEG-Eyetracking Evidence from the Reading of Books and Digital Media (PLOSone) — Comprehension accuracy did not differ across the three media for either group and EEG and eye fixations were the same. Yet readers stated they preferred paper. That preference, the authors conclude, isn’t because it’s less readable. From this perspective, the subjective ratings of our participants (and those in previous studies) may be viewed as attitudes within a period of cultural change.

March 04 2013

Untangling algorithmic illusions from reality in big data

Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:

Crawford explored many of these same topics in our interview, which follows.

What research are you working on now, following up on your paper on big data?

Kate Crawford: I’m currently researching how big data practices are affecting different industries, from news to crisis recovery to urban design. This talk was based on that upcoming work, touching on questions of smartphones as sensors, on dealing with disasters (like Hurricane Sandy), and new epistemologies — or ways we understand knowledge — in an era of big data.

When “Six Provocations for Big Data” came out in 2011, we were critiquing the very early stages of big data and social media. In the two years since, the issues we raised are even more prominent.

I’m now looking beyond social media to a range of other areas where big data is raising questions of social justice and privacy. I’m also editing a special issue on critiques of big data, which will be coming out later this year in the International Journal of Communications.

As more nonprofits and governments look to data analysis in governing or services, what do they need to think about and avoid?

Kate Crawford: Governments have a responsibility to serve all citizens, so it’s important that big data doesn’t become a proxy for “data about everyone.” There are two problems here: first is the question of who is visible and who isn’t represented; the second is privacy, or what I call “privacy practices” — because privacy means different things depending on where and who you are.

For example, the Streetbump app is brilliant. What city wouldn’t want to passively draw on data from all those smartphones out there, a constantly moving network of sensors? But, as we know, there are significant percentages of Americans who don’t have smartphones, particularly older citizens and those with lower disposable incomes. What happens to their neighborhoods if they generate no data? They fall off the map. To be invisible when governments make resource decisions is dangerous.

Then, of course, there’s the whole issue of people signing up to be passively tracked wherever they go. People may happily opt into it, but we’d want to be very careful about who gets that data, and how it is protected over the long term — not just five years, but 50 years and beyond. Governments might be tempted to use that data for other purposes, even civic ones, and this has significant implications for privacy and the expectations citizens have for the use of their data.

Where else could such biases apply?

Kate Crawford: There are many areas where big data bias is a problem from a social equity perspective. One of the key ones at the moment is law enforcement. I’m concerned by some of the work that seeks to “profile” areas, and even people, as likely to be involved in crime. It’s called “predictive policing” (more here). We’ve already seen some problematic outcomes when profiling was introduced for plane travel. Now, imagine what happens if you or your neighborhood falls on the wrong side of a predictive model. How do you even begin to correct the record? Which algorithm do you appeal to?

What are the things, as David Brooks listed recently, that big data can’t do?

Kate Crawford: There are lots of things that big data can’t do. It’s useful to consider the history of knowledge, and then imagine what it would look like if we only used one set of tools, one methodology for getting answers.

This is why I find people like Gabriel Tarde so interesting — he was grappling with ideas of method, big data and small data, back in the late 1800s.

He reminds us of what we can lose sight of when we go up orders of magnitude and try to leave small-scale data behind — like interviewing people, or observing communities, or running limited experiments. Context is key, and it is much easier to be attentive to context when we are surrounded by it. When context is dissolved into so many aggregated datasets, we can start getting mistaken impressions.

When Google Flu Analytics mistakenly predicted that 11% of the US had flu this year, that points to how relying on a big data signal alone may give us an exaggerated or distorted result (in that case, more than double the actual figure, which was between 4.5-4.8%). Now, imagine how much worse it would be if that data was all that health agencies had to work with.

I’m really interested in how we might best combine computational social science with traditional qualitative and ethnographic methods. With a range of tools and perspectives, we’re much more likely to get a three-dimensional view of a problem and be less prone to serious error. This goes beyond tacking on a few focus groups to big datasets, but conjoining deep, ethnographically-informed research with rich data sources.

What can the history of statistics in social science tell us about correlation vs causation? Does big data change that dynamic?

Kate Crawford: This is a gigantic question, and one that could be its own talk! With big datasets, it’s very tempting for researchers to engage in apophenia — seeing patterns where none actually exist — because massive quantities of data can point to a range of correlative possibilities.

For example, David Leinweber showed back in 2007 that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh. There’s
another great correlation between the use of Facebook and the rise of the Greek debt crisis.

With big data techniques, some people argue you can get much closer to being able to predict causal relations. But even here, big data tends to need several steps of preparation (data “cleaning” and pre-processing) and several steps in interpretation (deciding which of many analyses shows a positive result versus a null-result).

Basically, humans are still in the mix, and thus it’s very hard to escape false positives, strained correlations and cognitive bias.

February 25 2013

Big data is dead, long live big data: Thoughts heading to Strata

A recent VentureBeat article argues that “Big Data” is dead. It’s been killed by marketers. That’s an understandable frustration (and a little ironic to read about it in that particular venue). As I said sarcastically the other day, “Put your Big Data in the Cloud with a Hadoop.”

You don’t have to read much industry news to get the sense that “big data” is sliding into the trough of Gartner’s hype curve. That’s natural. Regardless of the technology, the trough of the hype cycle is driven by by a familiar set of causes: it’s fed by over-agressive marketing, the longing for a silver bullet that doesn’t exist, and the desire to spout the newest buzzwords. All of these phenomena breed cynicism. Perhaps the most dangerous is the technologist who never understands the limitations of data, never understands what data isn’t telling you, or never understands that if you ask the wrong questions, you’ll certainly get the wrong answers.

Big data is not a term I’m particularly fond of. It’s just data, regardless of the size. But I do like Roger Magoulas’ definition of “big data”: big data is when the size of the data becomes part of the problem. I like that definition because it scales. It was meaningful in 1960, when “big data” was a couple of megabytes. It will be meaningful in 2030, when we all have petabyte laptops, or eyeglasses connected directly to Google’s yottabyte cloud. It’s not convenient for marketing, I admit; today’s “Big Data!!! With Hadoop And Other Essential Nutrients Added” is tomorrow’s “not so big data, small data actually.” Marketing, for better or for worse, will deal.

Whether or not Moore’s Law continues indefinitely, the real importance of the amazing increase in computing power over the last six decades isn’t that things have gotten faster; it’s the size of the problems we can solve has gotten much, much larger. Or as Chris Gaun just wrote, big data is leading scientists to ask bigger questions. We’ve been a little too focused on Amdahl’s law, about making computing faster, and not focused enough on the reverse: how big a problem can you solve in a given time, given finite resources? Modern astronomy, physics, and genetics are all inconceivable without really big data, and I mean big on a scale that dwarfs Amazon’s inventory database. At the edges of research, data is, and always will be, part of the problem. Perhaps even the biggest part of the problem.

In the next year, we’ll slog through the cynicism that’s a natural outcome of the hype cycle. But I’m not worrying about cynicism. Data isn’t like Java, or Rails, or any of a million other technologies; data has been with us since before computers were invented, and it will still be with us when we move onto whatever comes after digital computing. Data, and specifically “big data,” will always be at the edges of research and understanding. Whether we’re mapping the brain or figuring out how the universe works, the biggest problems will almost always be the ones for which the size of the data is part of the problem. That’s an invariant. That’s why I’m excited about data.

February 21 2013

An update on in-memory data management

By Ben Lorica and Roger Magoulas

We wanted to give you a brief update on what we’ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have emerged, some expected, others surprising.

Performance improves as you put data as close to the computation as possible. We talked to people in systems, data management, web applications, and scientific computing who have embraced this concept. Some solutions go to the the lowest level of hardware (L1, L2 cache), The next generation SSDs will have latency performance closer to main memory, potentially blurring the distinction between storage and memory. For performance and power consumption considerations we can imagine a future where the primary way systems are sized will be based on the amount of non-volatile memory* deployed.

Putting data in-memory does not negate the importance of distributed computing environments. Data size and the ability to leverage parallel environments are frequently cited reasons. The same characteristics that make the distributed environments compelling also apply to in-memory systems: fault-tolerance and parallelism for performance. An additional consideration is the ability to gracefully spillover to disk when main is memory full.

There is no general purpose solution that can deliver optimal performance for all workloads. The drive for low latency requires different strategies depending on write or read intensity, fault-tolerance, and consistency. Database vendors we talked with have different approaches for transactional and analytic workloads, in some cases integrating in-memory into existing or new products. People who specialize in write-intensive systems identify hot data (i.e., frequently accessed) and put those in-memory.

Hadoop has emerged as an ingestion layer and the place to store data you might use. The next layer identifies and extracts high-value data that can be stored in-memory for low-latency interactive queries. Due to resource constraints of main memory, using columnar stores to compress data becomes important to speed I/O and store more in a limited space.

While it may be difficult to make in-memory systems completely transparent, the people we talked with emphasized programming interfaces that are as simple as possible.

Our conversations to date have revealed a wide range of solutions and strategies. We remain excited about the topic, and we’re continuing our investigation. If you haven’t yet, feel free to reach out to us on Twitter (Ben is @BigData and Roger is @rogerm) or leave a comment on this post.

* By non-volatile memory we mean the next-generation SSDs. In the rest of the post “memory” refers to traditional volatile main memory.

Related:

January 29 2013

Four short links: 29 January 2013

  1. FISA Amendment Hits Non-CitizensFISAAA essentially makes it lawful for the US to conduct purely political surveillance on foreigners’ data accessible in US Cloud providers. [...] [A] US judiciary subcommittee on FISAAA in 2008 stated that the Fourth Amendment has no relevance to non-US persons. Americans, think about how you’d feel keeping your email, CRM, accounts, and presentations on Russian or Chinese servers given the trust you have in those regimes. That’s how the rest of the world feels about American-provided services. Which jurisdiction isn’t constantly into invasive snooping, yet still has great bandwidth?
  2. Tim Berners-Lee Opposes Government Snooping“The whole thing seems to me fraught with massive dangers and I don’t think it’s a good idea,” he said in reply to a question about the Australian government’s data retention plan.
  3. Google’s Approach to Government Requests for Information (Google Blog) — they’ve raised the dialogue about civil liberties by being so open about the requests for information they receive. Telcos and banks still regard these requests as a dirty secret that can’t be talked about, whereas Google gets headlines in NPR and CBS for it.
  4. Open Internet Tools Projectsupports and incubates a collection of free and open source projects that enable anonymous, secure, reliable, and unrestricted communication on the Internet. Its goal is to enable people to talk directly to each other without being censored, surveilled or restricted.

January 18 2013

Four short links: 18 January 2013

  1. Bruce Sterling InterviewIt changed my work profoundly when I realized I could talk to a global audience on the Internet, although I was legally limited from doing that by national publishing systems. The lack of any global book market has much reduced my interest in publishing books. National systems don’t “publish” me, but rather conceal me. This especially happens to writers outside the Anglophone market, but I know a lot of them, and I’ve become sensitized to their issues. It’s one of the general issues of globalization.
  2. bAdmin — database of default usernames and passwords for popular software. (via Reddit /r/netsec)
  3. Just Post It: The Lesson from Two Cases of Fabricated Data Detected by Statistics Alone (Uri Simonsohn) — I argue that requiring authors to post the raw data supporting their published results has, among many other benefits, that of making fraud much less likely to go undetected. I illustrate this point by describing two cases of fraud I identified exclusively through statistical analysis of reported means and standard deviations. Analyses of the raw data behind these provided invaluable confirmation of the initial suspicions, ruling out benign explanations (e.g., reporting errors, unusual distributions), identifying additional signs of fabrication, and also ruling out one of the suspected fraudster’s explanations for his anomalous results. (via The Atlantic)
  4. ScriptCraft — Javascript in Minecraft. Important because All The Kids play Minecraft. (via Javascript Weekly)

January 17 2013

Yelp partners with NYC and SF on restaurant inspection data

One of the key notions in my “Government as a Platform” advocacy has been that there are other ways to partner with the private sector besides hiring contractors and buying technology. One of the best of these is to provide data that can be used by the private sector to build or enrich their own citizen-facing services. Yes, the government runs a weather website but it’s more important that data from government weather satellites shows up on the Weather Channel, your local TV and radio stations, Google and Bing weather feeds, and so on. They already have more eyeballs and ears combined than the government could or should possibly acquire for its own website.

That’s why I’m so excited to see a joint effort by New York City, San Francisco, and Yelp to incorporate government health inspection data into Yelp reviews. I was involved in some early discussions and made some introductions, and have been delighted to see the project take shape.

My biggest contribution was to point to GTFS as a model. Bibiana McHugh at the city of Portland’s TriMet transit agency reached out to Google, Bing, and others with the question: “If we came up with a standard format for transit schedules, could you use it?” Google Transit was the result — a service that has spread to many other U.S. cities. When you rejoice in the convenience of getting transit timetables on your phone, remember to thank Portland officials as well as Google.

In a similar way, Yelp, New York, and San Francisco came up with a data format for health inspection data. The specification is at http://yelp.com/healthscores. It will reportedly be announced at the US Conference of Mayors with San Francisco Mayor Ed Lee today.

Code for America built a site for other municipalities to pledge support. I’d also love to see support in other local restaurant review services from companies like Foursquare, Google, Microsoft, and Yahoo!  This is, as Chris Anderson of TED likes to say, “an idea worth spreading.”

December 27 2012

Four short links: 27 December 2012

  1. Improving the Security Posture of Industrial Control Systems (NSA) — common-sense that owners of ICS should already be doing, but which (because it comes from the NSA) hopefully they’ll listen to. See also Wired article on NSA targeting domestic SCADA systems.
  2. Geographic Pricing Online (Wall Street) — Staples, Discover Financial Services, Rosetta Stone, and Home Depot offer discounts if you’re close to a competitor, higher prices otherwise. [U]sing geography as a pricing tool can also reinforce patterns that e-commerce had promised to erase: prices that are higher in areas with less competition, including rural or poor areas. It diminishes the Internet’s role as an equalizer.
  3. Hacker Scouting (NPR) — teaching kids to be safe and competent in the world of technology, just as traditional scouting teaches them to be safe and competent in the world of nature.
  4. pressureNET Data Visualization — open source barometric data-gathering software which runs on Android devices. Source is on GitHub.

December 14 2012

LISA mixes the ancient and modern: report from USENIX system administration conference

I came to LISA, the classic USENIX conference, to find out this year who was using such advanced techniques as cloud computing, continuous integration, non-relational databases, and IPv6. I found lots of evidence of those technologies in action, but also had the bracing experience of getting stuck in a talk with dozens of Solaris fans.

Such is the confluence of old and new at LISA. I also heard of the continued relevance of magnetic tape–its storage costs are orders of magnitude below those of disks–and of new developements on NFS. Think of NFS as a protocol, not a filesystem: it can now connect many different filesystems, including the favorites of modern distributed system users.

LISA, and the USENIX organization that valiantly unveils it each year, are communities at least as resilient as the systems that their adherents spend their lives humming. Familiar speakers return each year. Members crowd a conference room in the evening to pepper the staff with questions about organizational issues. Attendees exchange their t-shirts for tuxes to attend a three-hour reception aboard a boat on the San Diego harbor, which this time was experiencing unseasonably brisk weather. (Full disclosure: I skipped the reception and wrote this article instead.)Let no one claim that computer administrators are anti-social.

Again in the spirit of full disclosure, let me admit that I perform several key operations on a Solaris system. When it goes away (which someday it will), I’ll have to alter some workflows.

The continued resilience of LISA

Conferences, like books, have a hard go of it in the age of instant online information. I wasn’t around in the days when people would attend conferences to exchange magnetic tapes with their free software, but I remember the days when companies would plan their releases to occur on the first day of a conference and would make major announcements there. The tradition of using conferences to propel technical innovation is not dead; for instance, OpenStack was announced at an O’Reilly Open Source convention.

But as pointed out by Thomas Limoncelli, an O’Reilly author (Time Management for System Administrators) and a very popular LISA speaker, the Internet has altered the equation for product announcements in two profound ways. First of all, companies and open source projects can achieve notoriety in other ways without leveraging conferences. Second, and more subtly, the philosophy of “release early, release often” launches new features multiple times a year and reduces the impact of major versions. The conferences need a different justification.

Limoncelli says that LISA has survived by getting known as the place you can get training that you can get nowhere else. “You can learn about a tool from the person who created the tool,” he says. Indeed, at the BOFs it was impressive to hear the creator of a major open source tool reveal his plans for a major overhaul that would permit plugin modules. It was sobering though to hear him complain about a lack of funds to do the job, and discuss with the audience some options for getting financial support.

LISA is not only a conference for the recognized stars of computing, but a place to show off students who can create a complete user administration interface in their spare time, or design a generalized extension of common Unix tools (grep, diff, and so forth) that work on structured blocks of text instead of individual lines.

Another long-time attendee told me that companies don’t expect anyone here to whip out a checkbook in the exhibition hall, but they still come. They have a valuable chance at LISA to talk to people who don’t have direct purchasing authority but possess the technical expertise to explain to their bosses the importance of new products. LISA is also a place where people can delve as deep as the please into technical discussions of products.

I noticed good attendance at vendor-sponsored Bird-of-a-Feather sessions, even those lacking beer. For instance, two Ceph staff signed up for a BOF at 10 in the evening, and were surprised to see over 30 attendees. It was in my mind a perfect BOF. The audience talked more than the speakers, and the speakers asked questions as well as delivering answers.

But many BOFs didn’t fit the casual format I used to know. Often, the leader turned up with a full set of slides and took up a full hour going through a list of new features. There were still audience comments, but no more than at a conference session.

Memorable keynotes

One undeniable highlight of LISA was the keynote by Internet pioneer Vint Cerf. After years in Washington, DC, Cerf took visible pleasure in geeking out with people who could understand the technical implications of the movements he likes to track. His talk ranged from the depth of his wine cellar (which he is gradually outfitting with sensors for quality and security) to interplanetary travel.

The early part of his talk danced over general topics that I think were already adequately understood by his audience, such as the value of DNSSEC. But he often raised useful issues for further consideration, such as who will manage the billions of devices that will be attached to the Internet over the next few years. It can be useful to delegate read access and even write access (to change device state) to a third party when the device owner is unavailable. In trying to imagine a model for sets of device, Cerf suggested the familiar Internet concept of an autonomous system, which obviously has scaled well and allowed us to distinguish routers running different protocols.

The smart grid (for electricity) is another concern of Cerf’s. While he acknowledged known issues of security and privacy, he suggested that the biggest problem will be the classic problem of coordinated distributed systems. In an environment where individual homes come and go off the grid, adding energy to it along with removing energy, it will be hard to predict what people need and produce just the right amount at any time. One strategy involves microgrids: letting neighborhoods manage their own energy needs to avoid letting failures cascade through a large geographic area.

Cerf did not omit to warn us of the current stumbling efforts in the UN to institute more governance for the Internet. He acknowledged that abuse of the Internet is a problem, but said the ITU needs an “excuse to continue” as radio, TV, etc. migrate to the Internet and the ITU’s standards see decreasing relevance.

Cerf also touted the Digital Vellum project for the preservation of data and software. He suggested that we need a legal framework that would require software developers to provide enough information for people to continue getting access to their own documents as old formats and software are replaced. “If we don’t do this,” he warned, “our 22nd-century descendants won’t know much about us.”

Talking about OpenFlow and Software Defined Networking, he found its most exciting opportunity is to let us use content to direct network traffic in addition to, or instead of, addresses.

Another fine keynote was delivered by Matt Blaze on a project he and colleagues conducted to assess the security of the P25 mobile systems used everywhere by security forces, including local police and fire departments, soldiers in the field, FBI and CIA staff conducting surveillance, and executive bodyguards. Ironically, there are so many problems with these communication systems that the talk was disappointing.

I should in no way diminish the intelligence and care invested by these researchers from the University of Pennsylvania. It’s just the history of P25 makes security lapses seem inevitable. Because it was old, and was designed to accommodate devices that were even older, it failed to implement basic technologies such as asymmetric encryption that we now take for granted. Furthermore, most of the users of these devices are more concerned with getting messages to their intended destinations (so that personnel can respond to an emergency) than in preventing potential enemies from gaining access. Putting all this together, instead of saying “What impressive research,” we tend to say, “What else would you expect?”

Random insights

Attendees certainly had their choice of virtualization and cloud solutions at the conference. A very basic introduction to OpenStack was offered, along with another by developers of CloudStack. Although the latter is older and more settled, it is losing the battle of mindshare. One developer explained that CloudStack has a smaller scope than OpenStack, because CloudStack is focused on high-computing environments. However, he claimed, CloudStack works on really huge deployments where he hasn’t seen other successful solutions. Yet another open source virtuallization platform presented was Google’s Ganeti.

I also attended talks and had chats with developers working on the latest generation of data stores: massive distributed file systems like Hadoop’s HDFS, and high-performance tools such as HBase and Impala, for accessing the data it stores. There seems be accordion effect in data stores: developers start with simple flat or key-value structures. Then they find the need over time–depending on their particular applications–for more hierarchy or delimited data, and either make their data stores more heavyweight or jerry-rig the structure through conventions such as defining fields for certain purposes. Finally we’re back at something mimicking the features of a relational database, and someone rebels and starts another bare-bones project.

One such developer told me hoped his project never turns into a behemoth like CORBA or (lamentably) what WS-* specifications seem to have wrought.

CORBA is universally recognized as dead–perhaps stillborn, because I never heard of major systems deployed in production. In fact, I never knew of an implementation that caught up with the constant new layers of complexity thrown on by the standards committee.

In contrast, WS-* specifications teeter on the edge of acceptability, as a number of organizations swear by it.

I pointed out to my colleague that most modern cloud or PC systems are unlikely to suffer from the weight of CORBA or WS-*, because the latter two systems were created for environments without trust. They were meant to tie together organizations with conflicting goals, and were designed by consortia of large vendors jockeying for market share. For both of these reasons, they have to negotiate all sorts of parameters and add many assurances to every communication.

Recently we’ve seen an increase of interest in functional programming. It occurred to me this week that many aspects of functional programming go nicely with virtualization and the cloud. When you write code with no side effects and no global lack of state, you can recover more easily when instances of your servers disappear. It’s fascinating to see how technologies coming from many different places push each other forward–and sometimes hold each other back.

December 12 2012

Four short links: 12 December 2012

  1. Kiwi Bond Films Are The Most Violent (Peter Griffin) — it wasn’t always furry-footed plucky adventurers in Middle Earth, my friends. Included to show that you can take an evidence-based approach to almost any argument.
  2. Are Githubbers Taking Open Source Seriously?nearly 140 of the 175 projects analyzed contain such an easily findable license information, or more precisely 78%. Or, alternatively 22% of Github projects don’t have easily findable license information. zomg. (via Simon Phipps)
  3. The Oh Shit (Matt Jones) — the condition of best-laid plans meeting reality. When all the drawings, sections, detailed drawings and meticulous sourcing in the world clash with odd corners of the physical world, weather, materials and not least the vagaries of human labour. It’s what Bryan Boyer calls the “Matter Battle”. He puts it beautifully: “One enters a Matter Battle when there is an attempt to execute the desires of the mind in any medium of physical matter.”
  4. Text Messages Direct to your Contact Lens (The Telegraph) — I want this so bad. It’s a future I can believe in. Of course, the free ones will have spam.
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
(PRO)
No Soup for you

Don't be the product, buy the product!

close
YES, I want to SOUP ●UP for ...