Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 12 2012

Strata Week: Add structured data, lose local flavor?

Here are a few of the data stories that caught my attention this week:

A possible downside to Wikidata

Wikidata data model diagram
Screenshot from the Wikidata Data Model page.

The Wikimedia Foundation — the good folks behind Wikipedia — recently proposed a Wikidata initiative. It's a new project that would build out a free secondary database to collect structured data that could provide support in turn for Wikipedia and other Wikimedia projects. According to the proposal:

"Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata, you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today."

But in The Atlantic this week, Mark Graham, a research fellow at the Oxford Research Institute, takes a look at the proposal, calling these "changes that have worrying connotations for the diversity of knowledge in the world's sixth most popular website." Graham points to the different language editions of Wikipedia, noting that the encyclopedic knowledge contained therein is always highly diverse. "Not only does each language edition include different sets of topics, but when several editions do cover the same topic, they often put their own, unique spin on the topic. In particular, the ability of each language edition to exist independently has allowed each language community to contextualize knowledge for its audience."

Graham fears that emphasizing a standardized, machine-readable, semantic-oriented Wikipedia will lose this local flavor:

"The reason that Wikidata marks such a significant moment in Wikipedia's history is the fact that it eliminates some of the scope for culturally contingent representations of places, processes, people, and events. However, even more concerning is that fact that this sort of congealed and structured knowledge is unlikely to reflect the opinions and beliefs of traditionally marginalized groups."

His arguments raise questions about the perceived universality of data, when in fact what we might find instead is terribly nuanced and localized, particularly when that data is contributed by humans who are distributed globally.

The intricacies of Netflix personalization

Netflix suggestion buttonNetflix's recommendation engine is often cited as a premier example of how user data can be mined and analyzed to build a better service. This week, Netflix's Xavier Amatriain and Justin Basilico penned a blog post offering insights into the challenges that the company — and thanks to the Netflix Prize, the data mining and machine learning communities — have faced in improving the accuracy of movie recommendation engines.

The Netflix post raises some interesting questions about how the means of content delivery have changed recommendations. In other words, when Netflix refocused on its streaming product, viewing interests changed (and not just because the selection changed). The same holds true for the multitude of ways in which we can now watch movies via Netflix (there are hundreds of different device options for accessing and viewing content from the service).

Amatriain and Basilico write:

"Now it is clear that the Netflix Prize objective, accurate prediction of a movie's rating, is just one of the many components of an effective recommendation system that optimizes our members' enjoyment. We also need to take into account factors such as context, title popularity, interest, evidence, novelty, diversity, and freshness. Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts."

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Got data news?

Feel free to email me.

Related:

Reposted byRK RK

January 25 2012

AI will eventually drive healthcare, but not anytime soon

TechCrunch recently published a guest post from Vinod Khosla with the headline "Do We Need Doctors or Algorithms?". Khosla is an investor and engineer, but he is a little outside his depth on some of his conclusions about health IT.

Let me concede and endorse his main point that doctors will become bionic clinicians by teaming with smart algorithms. He is also right that eventually the best doctors will be artificial intelligence (AI) systems — software minds rather than human minds.

That said, I disagree with Khosla on almost all of the details. Khosla has accidentally embraced a perspective that too many engineers and software guys bring to health IT.

Bear with me — I am the guy trying to write the "House M.D." AI algorithms that Khosla wants. It's harder than he thinks because of two main problems that he's not considering: The search space problem and the good data problem.

The search space problem

Any person even reasonably informed about AI knows about Go, an ancient game with simple rules. Those simple rules hide the fact that Go is a very complex game indeed. For a computer, it is much harder to play than chess.

Almost since the dawn of computing, chess was regarded as something that required intelligence and was therefore a good test of AI. In 1997, the world chess champion was beaten by a computer. In the year after, a professional Go player beat the best Go software in the world with a 25 stone handicap. Artificial intelligence experts study Go carefully precisely because it is so hard for computers. The approach that computers take toward being smart — thinking of lots of options really fast — stops working when the number of options skyrockets, and the number of potentially right answers also becomes enormous. Most significantly, Go can always be made more computationally difficult by simply expanding the board.

Make no mistake, the diagnosis and treatment of human illness is like Go. It's not like chess. Khosla is making a classic AI mistake, presuming that because he can discern the rules easily, it means the game is simple. Chess has far more complex rules than Go, but it ends up being a simpler game for computers to play.

To be great at Go, software must learn to ignore possibilities, rather than searching through them. In short, it must develop "Go instincts." The same is true for any software that could claim to be a diagnostician.

How can you tell when software diagnosticians are having search problems? When they cannot tell the difference between all of the "right" answers to a particular problem. The average doctor does not need to be told "could it be Zebra Fever?" by a computer that cannot tell that it should have ignored any zebra-related possibilities because it is not physically located in Africa. (No zebras were harmed in the writing of this article, and I do not believe there is a real disease called Zebra Fever.)

The good data problem

The second problem is the good data problem, which is what I spend most of my time working on.

Almost every time I get over-excited about the Direct Project or other health data exchange progress, my co-author David Uhlman brings me back to earth:

What good is it to have your lab results transferred from hospital A to hospital B using secure SMTP and XML? They are going to re-do the labs anyway because they don't trust the other lab.

While I still have hope for health information exchange in the long term, David is right in the short term. Healthcare data is not remotely solid or trustworthy. A good majority of the time, it is total crap. The reason that doctors insist on having labs done locally is not because they don't trust the competitor's lab; it's more of a "devil that you know" effect. They do not trust their own labs either, but they have a better understanding of how and when their own labs screw up. That is not a good environment for medical AI to blossom.

The simple reality is that doctors have good reason to be dubious about the contents of an EHR record. For lots of reasons, not the least of which is that the codes they are potentially entering there are not diagnostically helpful or valid.

Non-healthcare geeks presume that the dictionaries and ontologies used to encode healthcare data are automatically valid. But in fact, the best assumption is that ontologies consistently lead to dangerous diagnostic practices, as they shepherd clinicians into choosing a label for a condition rather than a true diagnosis. Once a patient's chart has a given label, either for diagnosis or for treatment, it can be very difficult to reassess that patient effectively. There is even a name for this problem: clinical inertia. Clinical inertia is an issue with or without computer software involved, but it is very easy for an ontology of diseases and treatments to make clinical inertia worse. The fact is, medical ontologies must be constantly policed to ensure that they do not make things worse, rather then better.

It simply does not matter how good the AI algorithm is if your healthcare data is both incorrect and described with a faulty healthcare ontology. My personal experiences with health data on a wide scale? It's like having a conversation with a habitual liar who has a speech impediment.

So Khosla is not "wrong" per-se; he's just focused on solving the wrong parts of the problem. As a result, his estimations of when certain things will happen are pretty far off.

I believe that we will not have really good diagnostic software until after the singularity and until after we can ensure that healthcare data is reliable. I actually spend most of my time on the second problem, which is really a sociological problem rather then a technology problem.

Imagine if we had a "House AI" before we were able to feed it reliable data? Ironically it would be very much like the character on TV: constantly annoyed that everyone around him keeps screwing up and getting in his way.

Anyone who has seen the show knows that the House character is constantly trying to convince the other characters that the patients are lying. The reality is that the best diagnosticians typically assume that the chart is lying before they assume that the patient is lying. With notable exceptions, the typical patient is highly motivated to get a good diagnosis and is, therefore, honest. The chart, on the other hand, be it paper or digital, has no motivation whatsoever, and it will happily mix in false lab reports and record inane diagnoses from previous visits.

The average doctor doubts the patient chart but trusts the patient story. For the foreseeable future, that is going to work much better than an algorithmically focused approach.

Eventually, Khosla's version of the future (which is typical of forward-thinking geeks in health IT) will certainly happen, but I think it is still 30 years away. The technology will be ready far earlier. Our screwed up incentive systems and backward corporate politics will be holding us back. I hardly have to make this argument, however, since Hugo Campos recently made it so well.

Eventually, people will get better care from AI. For now, we should keep the algorithms focused on the data that we know is good and keep the doctors focused on the patients. We should be worried about making patient data accurate and reliable.

I promise you we will have the AI problem finished long before we have healthcare data that is reliable enough to train it.

Until that happens, imagine how Watson would have performed on "Jeopardy" if it had been trained on "Lord of the Rings" and "The Cat in the Hat" instead of encyclopedias. Until we have healthcare data that is more reliable than "The Cat in the Hat," I will keep my doctor, and you can keep your algorithms, thank you very much.

Meaningful Use and Beyond: A Guide for IT Staff in Health Care — Meaningful Use underlies a major federal incentives program for medical offices and hospitals that pays doctors and clinicians to move to electronic health records (EHR). This book is a rosetta stone for the IT implementer who wants to help organizations harness EHR systems.

Related:

January 05 2012

Strata Week: Unfortunately for some, Uber's dynamic pricing worked

Here are a few of the data stories that caught my attention this week.

Uber's dynamic pricing

Uber logoMany passengers using the luxury car service Uber on New Year's Eve suffered from sticker shock when they saw that a hefty surcharge had been added to their bills — a charge ranging from 3 to more than 6 times the regular cost of an Uber fare. Some patrons took to Twitter to complain about the pricing, and Uber responded with several blog posts and Quora answers, trying to explain the startup's usage of "dynamic pricing."

The idea, writes Uber engineer Dom Anthony Narducci, is that:

... when our utilization is approaching too high of levels to continue to provide low ETA's and good dispatches, we raise prices to reduce demand and increase supply. On New Year's Eve (and just after midnight), this system worked perfectly; demand was too high, so the price bumped up. Over and over and over and over again.

In other words, in order to maintain the service that Uber is known for — reliability — the company adjusted prices based on the supply and demand for transportation. And on New Year's Eve, says Narducci, "As for how the prices got that high, at a super simplistic level, it was because things went right."

TechCrunch contributor Semil Shah points to other examples of dynamic pricing, such as for airfares and hotels, and argues that we might see more of this in the future. "Starting now, consumers should also prepare to experience the underbelly of this phenomenon, a world where prices for goods and services that are in demand, either in quantity or at a certain time, aren't the same price for each of us."

But Reuters' Felix Salmon argues that this sort of algorithmic and dynamic pricing might not work well for most customers. It isn't simply that the prices for Uber car rides are high (they are always higher than a taxi anyway). He contends that the human brain really can't — or perhaps doesn't want to — handle this sort of complicated cost/benefit analysis for a decision like "should I take a cab or call Uber or just walk home." As such, he calls Uber:

... a car service for computers, who always do their sums every time they have to make a calculation. Humans don't work that way. And the way that Uber is currently priced, it's always going to find itself in a cognitive zone of discomfort as far as its passengers are concerned.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Apache Hadoop reaches v1.0

Hadoop logoThe Apache Software Foundation announced that Apache Hadoop has reached v1.0, an indication that the big data tool has achieved a certain level of stability and enterprise-readiness.

V1.0 "reflects six years of development, production experience, extensive testing, and feedback from hundreds of knowledgeable users, data scientists, and systems engineers, bringing a highly stable, enterprise-ready release of the fastest-growing big data platform," said the ASF in its announcement.

The designation by the Apache Software Foundation reaffirms the interest in and development of Hadoop, a major trend in 2011 and likely to be such again in 2012.

Proposed bill would repeal open access for federal-funded research

What's the future for open data, open science, and open access in 2012? Hopefully, a bill introduced late last month isn't a harbinger of what's to come.

The Research Works Act (HR 3699) is a proposed piece of legislation that would repeal the open-access policy at the National Institutes of Health (NIH) and prohibit similar policies from being introduced at other federal agencies. HR 3699 has been referred to the Committee on Oversight and Government Reform.

The main section of the bill is quite short:

"No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that

  • causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work; or
  • requires that any actual or prospective author, or the employer of such an actual or prospective author, assent to network dissemination of a private-sector research work."

The bill would prohibit the NIH and other federal agencies from requiring that grant recipients publish in open-access journals.

Got data news?

Feel free to email me.

Related:

December 05 2011

November 18 2011

Four short links: 18 November 2011

  1. Learning With Quantified Self -- this CS grad student broke Jeopardy records using an app he built himself to quantify and improve his ability to answer Jeopardy questions in different categories. This is an impressive short talk and well worth watching.
  2. Evaluating Text Extraction Algorithms -- The gold standard of both datasets was produced by human annotators. 14 different algorithms were evaluated in terms of precision, recall and F1 score. The results have show that the best opensource solution is the boilerpipe library. (via Hacker News)
  3. Parallel Flickr -- tool for backing up your Flickr account. (Compare to one day of Flickr photos printed out)
  4. Quneo Multitouch Open Source MIDI and USB Pad (Kickstarter) -- interesting to see companies using Kickstarter to seed interest in a product. This one looks a doozie: pads, sliders, rotary sensors, with LEDs underneath and open source drivers and SDK. Looks almost sophisticated enough to drive emacs :-)

October 14 2011

Four short links: 14 October 2011

  1. Theory of Relativity in Words of Four Letters or Less -- this does just what it says, and well too. I like it, as you may too. At the end, you may even know more than you do now.
  2. Effective Set Reconciliation Without Prior Context (PDF) -- paper on using Bloom filters to do set union (deduplication) efficiently. Useful in distributed key-value stores and other big data tools.
  3. Mental Notes -- each card has an insight from psychology research that's useful with web design. Shuffle the deck, peel off a card, get ideas for improving your site. (via Tom Stafford)
  4. The Internet of Things To Come (Mike Kuniavsky) -- Mike lays out the trends and technologies that will lead to an explosion in Internet of Things products. E.g., This abstraction of knowledge into silicon means that rather than starting from basic principles of electronics, designers can focus on what they're trying to create, rather than which capacitor to use or how to tell the signal from the noise. He makes it clear that, right now, we have the rich petrie dish in which great networked objects can be cultured.

July 18 2011

Four short links: 18 July 2011

  1. Organisational Warfare (Simon Wardley) -- notes on the commoditisation of software, with interesting analyses of the positions of some large players. On closer inspection, Salesforce seems to be doing more than just commoditisation with an ILC pattern, as can be clearly seen from Radian's 6 acquisition. They also seem to be operating a tower and moat strategy, i.e. creating a tower of revenue (the service) around which is built a moat devoid of differential value with high barriers to entry. When their competitors finally wake up and realise that the future world of CRM is in this service space, they'll discover a new player dominating this space who has not only removed many of the opportunities to differentiate (e.g. social CRM, mobile CRM) but built a large ecosystem that creates high rates of new innovation. This should be a fairly fatal combination.
  2. Learning to Win by Reading Manuals in a Monte-Carlo Framework (MIT) -- starting with no prior knowledge of the game or its UI, the system learns how to play and to win by experimenting, and from parsed manual text. They used FreeCiv, and assessed the influence of parsing the manual shallowly and deeply. Trust MIT to turn RTFM into a paper. For human-readable explanation, see the press release.
  3. A Shapefile of the TZ Timezones of the World -- I have nothing but sympathy for the poor gentleman who compiled this. Political boundaries are notoriously arbitrary, and timezones are even worse because they don't need a war to change. (via Matt Biddulph)
  4. Microsoft Adventure -- 1979 Microsoft game for the TRS-80 has fascinating threads into the past and into what would become Microsoft's future.

June 22 2011

Four short links: 22 June 2011

  1. DOM Snitch -- an experimental Chrome extension that enables developers and testers to identify insecure practices commonly found in client-side code. See also the introductory post. (via Hacker News)
  2. Spark -- Hadoop-alike in Scala. Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can outperform Hadoop by 30x. However, you can use Spark's convenient API to for general data processing too. (via Hilary Mason)
  3. Bagel -- an implementation of the Pregel graph processing framework on Spark. (via Oliver Grisel)
  4. Week 315 (Matt Webb) -- read this entire post. It will make you smarter. The company’s decisions aren’t actually the shareholders’ decisions. A company has a culture which is not the simple sum of the opinions of the people in it. A CEO can never be said to perform an action in the way that a human body can be said to perform an action, like picking an apple. A company is a weird, complex thing, and rather than attempt (uselessly) to reduce it to people within it, it makes more sense - to me - to approach it as an alien being and attempt to understand its biology and momentums only with reference to itself. Having done that, we can then use metaphors to attempt to explain its behaviour: we can say that it follows profit, or it takes an innovative step, or that it is middle-aged, or that it treats the environment badly, or that it takes risks. None of these statements is literally true, but they can be useful to have in mind when attempting to negotiate with these bizarre, massive creatures. If anyone wonders why I link heavily to BERG's work, it's because they have some incredibly thoughtful and creative people who are focused and productive, and it's Webb's laser-like genius that makes it possible. They're doing a lot of subtle new things and it's a delight and privilege to watch them grow and reflect.

June 07 2011

Algorithms are the new medical tests

ekg by krzakptak, on FlickrPredictive Medical Technologies claims that it can use real-time, intensive care unit (ICU) monitoring data to predict clinical events like cardiac arrest up to 24 hours ahead of time. Effectively, the startup's algorithms are new types of medical tests that an ICU doctor can take into consideration when deciding on a course of treatment.

Predictive Medical Systems is based in the University of Utah's medical accelerator, which is attached to a hospital. The system will soon be tested on a trial basis with real patients and ICU physicians.

I recently talked to CEO Bryan Hughes about using data in diagnosis. Our interview follows


What kinds of data is already available from hospital electronic medical records (EMR) and patient monitoring systems?

Bryan Hughes: We require that a hospital be at a certain technological level, in particular that the hospital has an EMR solution that is at minimum classified as Stage 4, or a Computerized Physician Order Entry system. Only about 100 hospitals in the U.S. are at this stage right now.

Once a hospital has achieved this stage, we can integrate with their computer systems and extract the raw data coming from the monitors, lab reports and even nursing notes. We can then perform realtime patient data mining and data analytics.

Our system works behind the scenes constantly analyzing the raw patient data coming in from a variety of sources like chemistry panels, urinalysis, micro biology, respiratory and bedside monitors. We attempt to alert the doctor early of an adverse event such as cardiac arrest, or that a patient might be trending toward an arrhythmia or pneumonia.

Health IT at OSCON 2011 — The conjunction of open source and open data with health technology promises to improve creaking infrastructure and give greater control and engagement for patients. These topics will be explored in the healthcare track at OSCON (July 25-29 in Portland, Ore.)

Save 20% on registration with the code OS11RAD

How does the system integrate into an ICU doctor's existing routine?

Bryan Hughes: Depending on the technological development of a hospital, doctors either do their rounds in the ICU using a piece of paper or using a bedside computer terminal. Older systems might employ a COW (Computer On Wheels).

For hospitals that are still paper based, they have to first get to the EMR stage. It is surprising that health care, the largest and quintessential information-based industry, has failed to harness modern information exchange for so long. The oral tradition and handwritten manuscripts remain prevalent throughout most of the sector.

For hospitals that have an EMR, there are still several fundamental problems. The single most daunting problem facing modern doctors is the overwhelming amount of data. Unfortunately, especially with the growing adoption of electronic medical records, this information is disparate and not immediately available. The ability for a clinician to practice medicine is rooted in the ability to make sound decisions on reliable information.

Disparate information makes it hard to connect the dots. Massive amounts of disparate information turns the dots into a confusing sea of blobs. The dots must be connected in a manner that allows the doctor to make immediate and intelligent decisions.

We look at the current trends and progressions of disease states in the now, and then look at what may be happening in the next 24 hours. We then push this information to a mobile device such as an iPad allowing the doctor to see the clinically relevant dots, allowing them to make better decisions in a timely manner.

Eventually we hope to expand to the entire hospital. But for now, the ICU is a big enough problem and a great starting point.

How do you use data to predict outcomes like cardiac arrest?

Bryan Hughes: We have two first-generation models: cardiac arrest and respiratory failure. We plan to apply our novel techniques to modeling sepsis, renal failure and re-intubation risk.

Without giving away too much of our secret sauce, we use non-hypothesis machine learning techniques, which have proven very promising so far. This approach allows us to eliminate any human "expert" bias from the models. The key then is to ensure that the data we use for development and training is clean. It is only now that medical data is in electronic and structured form that this is becoming readily available.

What kinds of data mining techniques do you use in the product?

Bryan Hughes: We use a variety of techniques. Again, without giving too much away, our approach is to use transparent algorithms rather than a black box approach. We have a patent strategy that allows us to effectively place a white fence around our technology while allowing the academic and medical community to review our results.

How do you judge the accuracy of the algorithms?

Bryan Hughes: To date, our results have been proven using retrospective models (historical ICU monitoring data and outcomes). Our next step is to deploy our technology into a validation trial — a validation trial produces evidence that a test or treatment produces a clinical benefit. That trial is about to start at the University of Utah Medical Center in Salt Lake City.

Once the integration is completed in the next several weeks, we will be running a double-blind, prospective study with patient data. While this is only a validation trial, we are following the FDA guidance. Once the trial is up and running, we plan on expanding the validation trial to include several more hospitals. It will be at least 12 months before we start any formal FDA trial.

How is the system updated over time?

Bryan Hughes; We have developed a unique architecture that allows the system to reduce the experiment to validation cycle to 8 to 10 months. Typically in the medical community, a hypothesis is developed, a model is built and then tested and if valid, a paper is published for peer review. Once the model is accepted, it can have a life span of several years of adoption and application, which is bad because as we know, information and knowledge changes as we learn and understand more. Models need to be consistently re-evaluated and re-examined.

Are any similar systems available?

Bryan Hughes: None in the ICU, or even dealing with patient care, that we have found to date. In other industries, predictive analysis and modeling are pretty common place. Even your spam filter employs many of the techniques that the most sophisticated risk analysis system might use.

Photo: ekg by krzakptak, on Flickr



Related:


June 01 2011

An ethical bargain

bookstore by loranger, on FlickrThe Readers' Forum is a small independent bookstore in Wayne, Pa. and I've been buying books from Al, the proprietor, since the mid '90s. It's ironic to me that while the one-two punch of Barnes & Noble and Amazon have been killing his business (and impoverishing him in the process) Al does with ease what every analytically-minded well-funded retailer has been trying to do: He gives me great recommendations that result in a very high marginal sales rate. I probably buy at least half of the books that he recommends.

I wander in, usually on my way up the street to grab a bite to eat, and we chat for a bit. Then he says "oh, I have something for you" and he makes his way to the counter and digs into a pile. "I remember you went to Mumbai after the attacks. You might find this interesting." And I did find it interesting, if not more than a little bit painful to read.

It looks so simple. He remembers what I buy, engages me in conversation, and sometimes in the process finds out more about me. Then he suggests things that he thinks I'll like. Or maybe he just suggests things he likes. I don't know, but it works. His recommendations aren't "like" the other things I've read in the typical clustering algorithm sense, and maybe that's exactly why they work. Anyone can suggest the next volume of Harry Potter, but his suggestions regularly stretch my notion of "the kinds of books I like."

I should clarify something. He doesn't engage me in conversation with the express purpose of feeding his algorithms, or at least I don't think he does. Over the years we have become friends. Maybe not hang out on the weekend friends, but friends in context. I look forward to stopping in for our chats and he enjoys the break. As a side benefit his algorithms get what they need and I get good reads. And that's the thing that retailers everywhere could learn from him. He isn't just trying to build a machine-embedded model of my behaviors and profit off it, he is engaging in a two-way interaction that is a pleasure in its own right. And as a result it's not the least bit creepy when he hits me with an uncannily good recommendation.

It's not like I was buying books there for years and then one day some guy pops out of the basement where I realize he's been watching me through the skylights and says "I think you'll like this book!"

"Uh, why do you think that?"

"Well, I have compiled a detailed dossier on your buying habits, including the things you picked up but didn't buy. You shouldn't put as many back down by the way. I imagine you think you have interesting taste but really, you can be a bit pedestrian in your picks. Also, sometimes my friends in other basements will let me know where you travel, how much money you make, and who you date in exchange for me telling them what you read. Anyway, based on all that I think you'll like this book, you should buy it!"

Nope, Al and I are engaging in an ethical bargain. For starters, it is completely obvious and transparent. I never imagine that I'm buying anonymously there. And I get more out the bargain than I give up because his algorithm is incredibly data efficient. Also because I know the nature of the deal, if I happen to need another copy of "The Anarchist's Cookbook," I can just buy it somewhere else.

Al is a sole proprietor, so when I buy from Readers' Forum I'm buying from a person. Despite the root meaning of "corporate," a corporation will never be able to be the simple embodiment of a single actor. However, I think they can still learn from Al.

The way corporate retailers do this stuff is often quite different. Too often their approach gives weight to the argument that corporations are fundamentally sociopathic. Here's one simple example: I went into a Meijer's near South Bend, Ind. to buy a six pack of beer a few months ago. At the self check out lane I scanned my Guinness and the machine asked me for my birthday. I was like "screw that, I'm not entering my birth date." So I canceled the transaction and got in line at a register that had a real live human cashier. She asked for ID to prove my age and when I handed it to her she started to key in my birthday. I stopped her and asked what she was doing. "I need your birthday to prove you are 21."

"Well, just look at it. You don't need to key it into the machine to know that I'm over 21."

"Sir, I'm required to." That's the point where I muttered "bull" under my breath, asked for my ID back, and walked out.

That's a sociopath at work. "Let's use the age limit for beer as an excuse to harvest customer birth dates. We can use that data to correlate them with data we are buying from marketing services providers and then when we fill our customers' mailboxes with pulped Brazilian forests we will have a 0.3% better chance of drawing them back into the store where our HFCS-loaded end caps will grab them. If a customer complains we'll just hide behind 'the process' and let our cashier deal with the awkward moment."

Perhaps that sounds a bit harsh, but that was my reaction and I haven't returned to one of their stores since. It's not just the sanctity of a bit of my PII that was a big deal, it was the dishonesty of an opaque and misrepresented bargain that got me so irritated. That is not the kind of thing Al would do.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD


As an aside, while people have been fascinated and disturbed by the news that Columbian drug cartels are building submarines, I'm wondering when they'll start building Hadoop clusters. Just wait till your corner dealer is entering profile data into his smart phone at the point of sale. "Sir, can I get your zip code and the last four of your social? Ah, I see you are a loyal customer. You should really join our loyalty program. I can offer you a free bag of orange kush if you sign up today."

By the way, if it seems like I might be drawing a comparison between a 100,000 square-foot HFCS pusher and a Columbian drug cartel ...

Okay ... Let me just ask this: If you are involved in data capture, analytics, or customer marketing in your company, would you be embarrassed to admit to your neighbor what about them you capture, store and analyze? Would you be willing to send them a zip file with all of it to let them see it? If the answer is "no," why not? If I might hazard a guess at the answer, it would be because real relationships aren't built on asymmetry, and you know that. But rather than eliminate that awkward source of asymmetry, you hide it.

The reason I've been thinking about all of this is because of my new job. From 2003 until January I was a defense contractor. I worked for a company that built large-scale training simulations and did command-and-control system integration. I also worked on a lot of systems strategy and planning. I know for some people that makes me suspect, but while the bureaucracy, waste, and general inanity of the federal government sometimes drove me nuts, I never felt wrong about what I was doing. But in January I took on the challenge of building a data management practice focused on Hadoop and while it has been fascinating work, it has given me the creeps more than a few times.

I think what's interesting is that you can't help but get caught up in the moment. "If we could just join this stuff with that stuff, and then get this additional attribute, we could build a really sweet model. I'm sure that would get you some prospecting lift." And then we all look at each other for a moment and go "wow, and that would be kinda creepy, too." Thankfully I'm not the only one reflecting on the ethics of all this.

Ft. Meade in Maryland is that state's single biggest consumer of electricity, and no small amount of it is being consumed by Hadoop (or similar) clusters that, as it turns out, are probably surveilling you. That is a troublesome thought, but only about half as troublesome to me as the the even more thorough, broad, and pervasive corporate surveillance we are unleashing on ourselves. The only thing that keeps me sleeping is that the competitive dimension will slow the rate that these pools of data coalesce.

Jeff Hammerbacher is concerned that the best minds of our generation are wasting their talents on advertising. I agree with him, vigorously, but to me the even bigger issue is that the kind of advertising we are doing now depends on pervasive surveillance and the reduction of who we are to mere behavioral models. In fact, I find it the height of sad irony that Jeff, and many good meaning people like him, have become the de facto arms dealers of the surveillance state. Of course that isn't their intent, and they may not work inside the beltway, but they are no less arms dealers than Boeing is. "Would you like a Dreamliner Hadoop cluster, or the F/A-18 kind? Never mind, they are exactly the same."

When Apple got pie on its face with that location data hubbabaloo the only real surprise as far as I was concerned was that it was a surprise at all. But that's the point, really. The fact that people were surprised speaks to the asymmetry of the bargains they are party to. The Jeff Jonases of the world certainly understand what data is being collected on all of us, but the average consumer doesn't. And why don't they? Are they stupid? Nope. Do they not care? Apparently at least some of them do care.

No, I think it's because our online relationships aren't at all like my real-world relationship with Al. The full nature of the transaction isn't obvious, visible, and transparent and there is little chance a corporation will think like my friend. Most of the relationships you build with corporations are like icebergs and essentially hidden from view, and corporations like it that way. We don't really want people asking questions about stuff we think they won't understand. As corporations we may be sociopathic, but even a sociopath knows that awkward questions aren't just uncomfortable, they're bad for business.

So, assuming there could be a more human corporation, that could build symmetric ethically-grounded relationships with you, what would that relationship look like? Would transparency and choice be enough to make it symmetric? Could a relationship with a corporation feel at all like the one I have with Al? Could it be obvious, transparent, and a pleasure in its own right? Or, what if instead of asking ourselves "what data do we need and how could we get lift from it?" we asked "what is the value to our customer when we store and use this data and how do we make both the value and our stewardship of the data obvious and transparent?"

Photo: bookstore by loranger, on Flickr



Related:


December 22 2009

140-Zeichen-Musik

SuperCollider ist eine Programmier-Umgebung zur Audiosynthese in Echtzeit und für algorithmische Kompositionen, man könnte auch sagen: zum modernen Musikmachen. SuperCollider wurde von James McCartney geschrieben und ist inzwischen ein Open Source-Projekt, an dem Musiker, Wissenschaftler, Künstler und Programmierer mitarbeiten.

In Zusammenarbeit mit dem britischen Magazin The Wire haben 22 Musiker rund um die Welt SuperCollider-Stücke komponiert – jedes Stück hat nicht mehr als 140 Zeichen Code. → Hier die Track-Liste, zum Anhören in eine Zeile klicken.

Hier → die Code-Stücke zum Bestaunen. |

Reposted fromglaserei glaserei
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl