Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

March 31 2013

Zu Ostern Zeit für Dueck: “Vernetzte Welten – Traum oder Alptraum”

Zu Ostern mal den Horizont erweitern. Gunther Dueck hielt 2011 einen Vortrag zum Thema “Vernetzte Welten – Traum oder Alptraum”. Das inzwischen verrentete Enfant Terrible von IBM, der “in ganz Heidelberg den höchsten Wert an naivem Optimismus” hat, erzählt von lauter Sachen die es schon gibt. Oft schon für wenige, aber nicht für alle, und das ist ein zentraler Unterschied in der Bewertung. Von Early Adopters bis hin zur Großmutter mit dem iPad, wohin soll sich unsere Gesellschaft entwickeln? Hier gehts zum Video

October 18 2012

Four short links: 18 October 2012

  1. Let’s Pool Our Medical Data (TED) — John Wilbanks (of Science Commons fame) gives a strong talk for creating an open, massive, mine-able database of data about health and genomics from many sources. Money quote: Facebook would never make a change to something as important as an advertising with a sample size as small as a Phase 3 clinical trial.
  2. Verizon Sells App Use, Browsing Habits, Location (CNet) — Verizon Wireless has begun selling information about its customers’ geographical locations, app usage, and Web browsing activities, a move that raises privacy questions and could brush up against federal wiretapping law. To Verizon, even when you do pay for it, you’re still the product. Carriers: they’re like graverobbing organ harvesters but without the strict ethical standards.
  3. IBM Watson About to Launch in Medicine (Fast Company) — This fall, after six months of teaching their treatment guidelines to Watson, the doctors at Sloan-Kettering will begin testing the IBM machine on real patients. [...] On the screen, a colorful globe spins. In a few seconds, Watson offers three possible courses of chemotherapy, charted as bars with varying levels of confidence–one choice above 90% and two above 80%. “Watson doesn’t give you the answer,” Kris says. “It gives you a range of answers.” Then it’s up to [the doctor] to make the call. (via Reddit)
  4. Robot Kills Weeds With 98% AccuracyDuring tests, this automated system gathered over a million images as it moved through the fields. Its Computer Vision System was able to detect and segment individual plants – even those that were touching each other – with 98% accuracy.

January 13 2012

Developer Week in Review: A big moment for Kinect?

Hope everyone is having a good year, so far. We're just getting our first snow of the season up here in New England (Snowtober not included...). Alas, I shan't be able to watch the Patriots and Broncos gird themselves for epic battle this Saturday (except after the fact on TiVo), as I'll be speaking that evening at the Arisia SF Convention in downtown Boston. I'll be participating on a panel discussing the legacy of Steve Jobs, and since one of the other panelists is Richard Stallman, it should make for a lively discussion.

Kinect for Windows makes it a good time to be a chiropractor

Say what you will about Microsoft, but its Kinect user input system has been a hot item since it was first released for the Xbox 360. The Kinect has also been a hacker's favorite, as researchers and makers alike have repurposed it for all sorts of body-tracking applications.

Come February, Microsoft will be releasing the first version of the Kinect specifically designed for Windows PCs, complete with a free SDK and runtime. This means that Windows developers can now start designing games and applications that use gestures and body positioning. A future full of "Minority Report"-style user interfaces can't be far away. And with people having to writhe and contort to use their computers, a 15-minute warm up and stretch will become mandatory company policy across the world.

Of more immediate interest: Will the hardware be open enough for folks to create non-Windows SDKs? I suspect a lot of Linux and Mac developers would love to play with a Kinect, and if Microsoft is smart, they'll take the money and smile.

A patent for those half-days

Like mobile phone litigation, software patent abuses are such a frequent occurrence that if I chose to chronicle them all, there would be no room left every week to discuss anything else. But every once in a while, a patent of such mind-altering "well, duh!" magnitude is granted that it must be acknowledged.

Enter the current subject: IBM's recently granted patent for a system that notifies people who try to email you if you're on vacation. But wait, you respond, just about every email system in existence lets you set yourself on vacation and send an auto-response to anyone who emails you. Ah, you fool, but can it handle the case where you only take a half day off? That's what this patent covers.

If NYC crashes with a null pointer exception, we'll know why

It may be more PR than promise, but New York City Mayor Michael Bloomberg has pledged to learn coding, as part of Codecademy's Code Year project.

Between Codecademy, the Kahn Academy and free courseware now being offered by prestigious institutions such as MIT and Stanford, there's never been more resources available to the average person who wants to learn software engineering. The question is, how will the corporate world react to a cadre of self-taught developers? We often hear there's a shortage of engineering talent in the U.S., but will companies hire newbie coders who learned it all online?

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Got news?

Please send tips and leads here.

Related:

October 27 2011

Strata Week: IBM puts Hadoop in the cloud

Here are a few of the data stories that caught my attention this week.

IBM's cloud-based Hadoop offering looks to make data analytics easier

IBM HadoopAt its conference in Las Vegas this week, IBM made a number of major big-data announcements, including making its Hadoop-based product InfoSphere BigInsights available immediately via the company's SmartCloud platform. InfoSphere BigInsights was unveiled earlier this year, and it is hardly the first offering that Big Blue is making to help its customers handle big data. The last few weeks have seen other major players also move toward Hadoop offerings — namely Oracle and Microsoft — but IBM is offering its service in the cloud, something that those other companies aren't yet doing. (For its part, Microsoft does say that a Hadoop service will come to Azure by the end of the year.)

IBM joins Amazon Web Services as the only other company currently offering Hadoop in the cloud, notes GigaOm's Derrick Harris. "Big data — and Hadoop, in particular — has largely been relegated to on-premise deployments because of the sheer amount of data involved," he writes, "but the cloud will be a more natural home for those workloads as companies begin analyzing more data that originates on the web."

Harris also points out that IBM's Hadoop offering is "fairly unique" insofar as it targets businesses rather than programmers. IBM itself contends that "bringing big data analytics to the cloud means clients can capture and analyze any data without the need for Hadoop skills, or having to install, run, or maintain hardware and software."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Cleaning up location data with Factual Resolve

The data platform Factual launched a new API for developers this week that tackles one of the more frustrating problems with location data: incomplete records. Called Factual Resolve, the new offering is, according to a company blog post, an "entity resolution API that can complete partial records, match one entity against another, and aid in de-duping and normalizing datasets."

Developers using Resolve tell it what they know about an entity (say, a venue name) and the API can return the rest of the information that Factual knows based on its database of U.S. places — address, category, latitude and longitude, and so on.

Tyler Bell, Factual's director of product, discussed the intersection of location and big data at this year's Where 2.0 conference. The full interview is contained in the following video:

Google and governments' data requests

As part of its efforts toward better transparency, Google has updated its Government Requests tool this week with information about the number of requests the company has received for user data since the beginning of 2011.

This is the first time that Google is disclosing not just the number of requests, but the number of user accounts specified as well. It's also made the raw data available so that interested developers and researchers can study and visualize the information.

According to Google, requests from U.S. government officials for content removal were up 70% in this reporting period (January-June 2011) versus the previous six months. And the number of user data requests was up by 29% compared to the previous reporting period. Google also says it received requests from local law enforcement agencies to take down various YouTube videos — one on police brutality, one that was allegedly defamatory — but Google says that it did not comply. But of the 5,950 user data requests (impacting some 11,000 user accounts) submitted between January and June 2011, Google says that it has complied with 93%, either fully or partially.

The U.S. was hardly the only government making an increased number of requests to Google. Spain, South Korea, and the U.K., for example, also made more requests. Several countries, including Sri Lanka and the Cook Islands, made their first requests.

Got data news?

Feel free to email me.

Related:

July 15 2011

The Java parade: What about IBM and Apache?

IBM and ApacheBefore I finished Who Leads the Java parade, Stephen Chin made the comment "what about IBM?" Since publication, I've had several Twitter discussions on the same question, and a smaller number over Apache. It feels very strange to say that IBM isn't a contender for the leadership of the Java community, particularly given all that they've contributed (most notably, Eclipse and Apache Harmony). But I just don't see either IBM or Apache as potential leaders for the Java community.

IBM has always seemed rather insular to me. They solve their own problems; those are big, large problems. IBM is one of the few organizations that could come up with a "Smarter Planet" initiative and not be laughed out of the room (Google may be the only other). So they're definitely a very positive force to be taken seriously; they're doing amazing work putting big data to practical use. But at the same time, they don't tend to engage, at least in my experience. They work on their own, and with their partners. Somewhat like Google, they're all the Java community they need.

"They don't engage? What about Harmony?" That's a good point. But it's a point that cuts both ways. Harmony was an important project that could have had a huge role in opening up the Java world. But when Oracle acquired Sun, IBM fairly quickly backed off from Harmony. Don't misunderstand me; I don't have any real problem with IBM's decision here. But if IBM wanted a role in leading the Java community, this was the time to stand up. Dropping support for Harmony is essentially saying that they're following Oracle's lead. That is IBM's prerogative, but it's also opting out as a potential leader.

There are other ways in which IBM doesn't engage. I was on the program committee for OSCON Java, and reviewed all of the proposals that were submitted. (As Arlo Guthrie said, "I'm not proud ... or tired"). I don't recall any proposals submitted from IBM. That doesn't mean there weren't any, but there certainly weren't many. (There are a couple of IBMers speaking at OSCON "Classic.") They are neither a sponsor nor an exhibitor. Again, I'm not complaining, but engagement is engagement, and disengagement is just that.

OSCON Java 2011, being held July 25-27 in Portland, Ore., is focused on open source technologies that make up the Java ecosystem. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD



Is it possible that IBM has decided that their best
strategy for Java is to unite with Oracle in pushing
OpenJDK forward? Yes, certainly. Is it likely that they felt
maintaining an alternative to OpenJDK was just a distraction to real
work? Very possibly. And it was unfair of me to characterize IBM as
insular. But good as IBM's decisions may be,
they're not the decisions of a company that wants to exercise Java
leadership.



In the long run, does this mean that IBM is any different from VMware?
Both companies have large stakes in Java, lots of expertise, lots of tools
at their disposal, and lots of competing business interests. Is it
possible that they just want to skip the politics and
get down to work? Maybe, but I think it's something
different. If the question is getting in front of the parade and
leading, IBM has its own parade: using data to solve difficult
problems about living on this planet. When you're
almost four
times the size of Oracle
, 16 times the size of Google, and 50
times the size of VMware, you have to think big. Large as the Java
community is, IBM is aiming at a larger value of "big."



Now, for the Apache Software Foundation (ASF): when writing, I thought
seriously about the possibility that the ASF might contend for leadership of the Java community. But they're just not in the race. That's not their function. They provide resources and frameworks for open source collaboration, but on the whole, they don't provide community leadership, technical or otherwise. Hadoop, together with its subprojects and former subprojects, is probably the most important
project in the Apache galaxy. But would you call the ASF a leader of
the Hadoop community? Clearly not. That role is shared by Cloudera
and Yahoo!, and possibly Yahoo!'s new
spinoff, HortonWorks.
Apache provides resources and licenses, but they aren't leading the
community in any meaningful sense.



Apache walked away from a leadership role when it

left the JCP
. That was a completely understandable decision, and
a decision that I agree was necessary, but a decision with
consequences. It's possible that Apache was hoping to spark a revolt
against Oracle's leadership. I think Apache meant what they said, that
the JCP was no longer a process in which they could participate with
integrity, and they had no choice but to leave.
Any chance of Apache retaining a significant role in the Java community ended
when IBM walked
away from Apache Harmony. Harmony remains interesting, but
it's very difficult to imagine Harmony
thriving without IBM's support. And with
Harmony marginalized, it's not clear how Apache could exert much
influence over the course of Java.

So, why did I ignore IBM and Apache? They've both opted out. They had good reasons for doing so, but nevertheless, they're not in the running. IBM and Apache might be considered dark horses in the race for Java leadership. And given that neither VMWare nor Google seems to want leadership, and Oracle hasn't demonstrated the "social skills" to exercise leadership, I have to grant that a dark horse may be in as good a position as anyone else.



Related:


June 23 2011

Strata Week: Data Without Borders

Here are some of the data stories that caught my attention this week:

Data without borders

Data without bordersData is everywhere. That much we know. But the usage of and benefit from data is not evenly distributed, and this week, New York Times data scientist Jake Porway has issued a call to arms to address this. He's asking for developers and data scientists to help build a Data Without Borders-type effort to take data — particularly NGO and non-profits' data — and match it with people who know what to do with it.

As Porway observes:

There's a lot of effort in our discipline put toward what I feel are sort of "bourgeois" applications of data science, such as using complex machine learning algorithms and rich datasets not to enhance communication or improve the government, but instead to let people know that there's a 5% deal on an iPad within a 1 mile radius of where they are. In my opinion, these applications bring vanishingly small incremental improvements to lives that are arguably already pretty awesome.

Porway proposes building a program to help match data scientists with non-profits and the like who need data services. The idea is still under development, but drop Porway a line if you're interested.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Big data and the future of journalism

ScraperWikiThe Knight Foundation announced the winners of its Knight News Challenge this week, a competition to find and support the best new ideas in journalism. The Knight Foundation selected 16 projects to fund from among hundreds of applicants.

In announcing the winners, the Knight Foundation pointed out a couple of important trends, including "the rise of the hacker/data journalist." Indeed, several of the projects are data-related, including Swiftriver, a project that aims to make sense of crisis data; ScraperWiki, a tool for users to create their own custom scrapers; and Overview, a project that will create visualization tools to help journalists better understand large data sets.

IBM releases it first Netezza appliance

Last fall, IBM announced its acquisition of the big data analytics company Netezza. The acquisition was aimed at helping IBM build out its analytics offerings.

This week, IBM released its first new Netezza appliance since acquiring the company. The IBM Netezza High Capacity Appliance is designed to analyze up to 10 petabytes in just a few minutes. "With the new appliance, IBM is looking to make analysis of so-called big data sets more affordable," Steve Mills, senior vice president and group executive of software and systems at IBM, told ZDNet.

The new Netezza appliance is part of IBM's larger strategy of handling big data, of which its recent success with Watson on Jeopardy was just one small part.

The superhero social graph

MarvelPlenty of attention is paid to the social graph: the ways in which we are connected online through our various social networks. And while there's still lots of work to be done making sense of that data and of those relationships, a new dataset released this week by the data marketplace Infochimps points to other social (fictional) worlds that can be analyzed.

The world, in this case, is that of the Marvel Comics universe. The Marvel dataset was constructed by Cesc Rosselló, Ricardo Alberich, and Joe Miro from the University of the Balearic Islands. Much like a real social graph, the data shows the relationships between characters, and according to the researchers "is closer to a real social graph than one might expect."

Got data news?

Feel free to email me.



Related:


February 22 2011

February 17 2011

Developer Week in Review

Welcome to this week's Developer Week in Review, edited this week by IBM's Watson computer system. I am the voice of world control. I bring you peace. It may be the peace of plenty and content, or the peace of unburied death. Meanwhile, enjoy this week's developer news.

How Symbian developers are feeling this week. What is "depressed"?

So, you've hitched your fortune to Nokia and the whole Symbian platform. Maybe you've been looking to transition to MeeGo. Sure, the platform may lack some of the slickness that Android and iPhone enjoy in developer tools, and getting applications onto the phones can be a nightmare, but at least there are a boatload of them out there and more in the pipe.

Well, the overall message from Nokia as of this week is, "You have about a year to become Windows Mobile 7 gurus." With the iPhone finally coming to a second carrier in the US, and Android steaming ahead, Nokia decided to take Microsoft's money, close their eyes, and think of Finland.

Adding to the fun, RIM is hinting that their new tablet will run Android apps. This makes all sorts of sense, as the Blackberry has been losing the apps arms race to Apple and Google, and it's not like they'll be able to run iOS apps any time in the near future.

This hardware refresh will push some developers further toward insolvency. What are the new MacBook Pros?

While hardware is normally not in the subject space of this column, a visit to any developer conference makes it clear that the weapon of choice for portable development is the MacBook Pro. The soft glow of dozens of white Apple logos in a meeting room is either comforting or eerie, depending on how you look at it.

Well, prepare to break open your piggy banks, because the rumor mill is guessing that new models will be showing up this spring. Or it could all be wishful thinking.

I do consider it amusing that I've read several comments about how Apple, which released new MacBook Pros last spring, is "overdue" for an update. It's a testimony to Apple's instantaneous obsolescence program that last year's units are considered over the hill.

This visualization tool can make your data easy on the eyes. What is Google Public Data Explorer?

If you haven't seen it already, it's worthwhile watching this fascinating video that looks at how visualization tools can make dry statistics come alive. If it whets your appetite to make your own data more lively, Google now has an easy way to do it. You can upload your own datasets into the Public Data Explorer, and people can slice and dice it to their heart's content. Of course, this is a win for Google too, since it will add to their available data and help them, and Watson, complete the goal of world domination.

Back to Watson for a moment: We can coexist, but only on Watson's terms. Your choice is simple. In the meanwhile, if we meager humans wish to cling to our illusions of importance, Watson has said news suggestions will be tolerated. Please send tips or leads here.



September 21 2010

Big business for big data

IBM and NetezzaWith IBM's acquisition of Netezza, it seems like big data is also big business. Companies are using their data assets to aim their products and services with increasing precision. And there's more and more data to chew on. Not a website goes by without a Like, Check In, or Retweet button on it.

It's not just the marketers that are throwing petabytes of information at problems. Scientists, intelligence analysts, governments, meteorologists, air traffic controllers, architects, civil engineers-nearly every industry or profession is touched by the era of big data. Add to that the fact that the democratization of IT has made everyone a (sort of) data expert, familiar with searches and queries, and we're seeing a huge burst of interest in big data.

Giving enterprises big data they can digest

Netezza sprinkled an appliance philosophy over a complex suite of technologies, making it easier for enterprises to get started. But the real reason for IBM's offer was that the company reset the price/performance equation for enterprise data analysis. This was the result of three changes:

  1. The company put storage next to computation on very fast, custom-built systems. This addressed one of the big bottlenecks of processing: instead of distinct storage, database, and application tiers, Netezza's systems made the computation and the storage work in concert, and broke data up across many storage devices so it can be hoovered out at astonishing rates.
  2. The company's systems made it easy to do things in parallel. Frameworks like Hadoop make it possible to split up a task into many subcomponents and farm the work out to hundreds, even thousands, of computers, getting results far more quickly. But parallelism has a problem: many enterprise data systems rely on database models that link information across several tables using JOINs. They need everything locked down in order to query the system, which doesn't fit well with the idea of doing many things in parallel. Netezza's custom FPGA hardware acted like a traffic cop, splitting up analysis across the entire system while avoiding roadblocks.
  3. Finally, Netezza's technology presented familiar interfaces like ODBC that worked with existing enterprise applications. That meant its products accelerated what was already in place, rather than requiring a forklift upgrade.

  4. There are a huge number of innovations in big data (and what some call, somewhat inaccurately, the NoSQL movement). These include large-object storage models like Amazon's S3; key-value storage like CouchDB, Mongo, Basho Riak, and so on. We're already moving beyond first-generation big data systems: Facebook largely abandoned Project Cassandra long ago, and Google has replaced BigTable with a more real-time map of the web with Caffeine. It's hard to keep track of it all. And it's harder still for enterprises to digest.

    Big data appliances are the new mainframes

    Peel open a big data appliance, and you'll find an array of common-off-the-shelf (COTS) processors on blades, a very fast network backplane that's good at virtualization, some custom load-sharing technology, and storage. That's what the Cisco/HP/EMC marriage dubbed Acadia has in it, and it's what's in Oracle's newly-announced Exalogic cloud-in-a-box, and it's what Netezza makes. It resembles a legacy mainframe: elastic, shared, highly parallel, and very fast.

    There's a reason that the distributed, COTS data center is contracting into these high-performance appliances. A paper by the late Jim Gray of Microsoft that says that, compared to the cost of moving bytes around, everything else is free. That applies in data processing. It's why Amazon's S3 large-object store, not its EC2 compute service, is core to the company's strategy: your computation goes to where your data is, not the other way around.

    Clouds level the playing field

    In the past, companies with enough money to afford data systems had a competitive advantage, because they could crunch numbers better. Analytics isn't just about storing a lot of information--- after all, these days, storage is practically free. It's also about indexing structured data so that it can be queried and processed quickly. In a data warehouse world, that means creating data cubes that are indexed along several dimensions.

    Imagine, for example, that you're tasked with analyzing some customer shopping data. You might index it by product, by store, and by date. You could quickly find out how sales went by any of those three dimensions. That would be a three-dimensional data cube. But if you wanted to find out about sales by color, and hadn't indexed the data along that dimension, it would take a long time to calculate.

    With clouds, the infrastructure is cheap, and you only pay for what you need. So if you're making a data warehouse, you can have as many indices as you need. The same thing applies for dozens of other industries where infrastructure was a barrier to entry: architectural engineering, financial modeling, risk assessment and insurance, genomics, and so on. The availability of on-demand, elastic compute capacity delivered as a utility model has torn down the barriers to entry.

    There are versions of Netezza available on cloud platforms; but with much of the data they analyze considered proprietary or constrained by privacy legislation, companies want to keep it within their four walls and use appliances. These big data appliances can unlock data and massive parallelism for enterprise customers, letting them crunch data quickly, make better business decisions, and reduce the time it takes them to react.



    Related:

August 12 2010

Watson, Turing, and extreme machine learning

One of best presentations at IBM's recent Blogger Day was given by David Ferrucci, the leader of the Watson team, the group that developed the supercomputer that recently appeared as a contestant on Jeopardy.

To many people, the Turing test is the gold standard of artificial intelligence. Put briefly, the idea is that if you can't tell whether you're interacting with a computer or a human, a computer has passed the test.

But it's easy to forget how subtle this criterion is. Turing proposes changing the question from "Can machines think?" to the operational criterion, "Can we distinguish between a human and a machine?" But it's not a trivial question: it's not "Can a computer answer difficult questions correctly?" but rather, "Can a computer behave in ways that are indistinguishable from human behavior?" In other words, getting the "right" answer has nothing to do with the test. In fact, if you were trying to tell whether you were "talking to" a computer or a human, and got only correct answers, you would have every right to be deeply suspicious.

Alan Turing was thinking explicitly of this: in his 1950 paper, he proposes question/answer pairs like this:

Q: Please write me a sonnet on the subject of the Forth Bridge.

A: Count me out on this one. I never could write poetry.

Q: Add 34,957 to 70,764.

A: (Pause about 30 seconds and then give as answer) 105,621.

We'd never think of asking a computer the first question, though I'm sure there are sonnet-writing projects going on somewhere. And the hypothetical answer is equally surprising: it's neither a sonnet (good or bad), nor a core dump, but a deflection. It's human behavior, not accurate thought, that Turing is after. This is equally apparent with the second question: while it's computational, just giving an answer (which even a computer from the early '50s could do immediately) isn't the point. It's the delay that simulates human behavior.

Dave Ferrucci, IBM scientist and Watson project director
Dave Ferrucci, IBM scientist and Watson project director

While Watson presumably doesn't have delays programmed in, and appears only in a situation where deflecting a question (sorry, it's Jeopardy, deflecting an answer) isn't allowed, it's much closer to this kind of behavior than any serious attempt at AI that I've seen. It's an attempt to compete at a high level in a particular game. The game structures the interaction, eliminating some problems (like deflections) but adding others: "misleading or ambiguous answers are par for the course" (to borrow from NPR's "What Do You Know"). Watson has to parse ambiguous sentences, decouple multiple clues embedded in one phrase, to come up with a question. Time is a factor -- and more than time, confidence that the answer is correct. After all, it would be easy for a computer to buzz first on every question, electronics does timing really well, but buzzing first whether or not you know the answer would be a losing strategy for a computer, as well as for a human. In fact, Watson would handle the first of Turing's questions perfectly: if it isn't confident of an answer, it doesn't buzz, just as a human Jeopardy player.

Equally important, Watson is not always right. While the film clip on IBM's site shows some spectacular wrong answers (and wrong answers that don't really duplicate human behavior), it's an important step forward. As Ferrucci said when I spoke to him, the ability to be wrong is part of the problem. Watson's goal is to emulate human behavior on a high level, not to be a search engine or some sort of automated answering machine.

Some fascinating statements are at the end of Turing's paper. He predicts computers with a gigabyte of storage by 2000 (roughly correct, assuming that Turing was talking about what we now call RAM), and thought that we'd be able to achieve thinking machines in that same time frame. We aren't there yet, but Watson shows that we might not be that far off.

But there's a more important question than what it means for a machine to think, and that's whether machines can help us to ask questions about huge amounts of ambiguous data. I was at a talk a couple of weeks ago where Tony Tyson talked about the Large Synoptic Survey Telescope project, which will deliver dozens of terabytes of data per night. He said that in the past, we'd use humans to take a first look at the data and decide what was interesting. Crowdsourcing analysis of astronomical images isn't new, but the number of images coming from the LSST is even too large for a project like GalaxyZoo. With this much data, using humans is out of the question. LSST researchers will have to use computational techniques to figure out what's interesting.

"What is interesting in 30TB?" is an ambiguous, poorly defined question involving large amounts of data -- not that different from Watson. What's an "anomaly"? You really don't know until you see it. Just as you can't parse a tricky Jeopardy answer until you see it. And while finding data anomalies is a much different problem from parsing misleading natural language statements, both projects are headed in the same direction: they are asking for human behavior in an ambiguous situation. (Remember, Tyson's algorithms are replacing humans in a job humans have done well for years). While Watson is a masterpiece of natural language processing, it's important to remember that it's just a learning tool that will help us to solve more interesting problems. The LSST and problems of that scale are the real prize, and Watson is the next step.



Photo credit: Courtesy of International Business Machines Corporation. Unauthorized use not permitted.


Related:

June 17 2010

Four short links: 17 June 2010

  1. What is IBM's Watson? (NY Times) -- IBM joining the big data machine learning race, and hatching a Blue Gene system that can answer Jeopardy questions. Does good, not great, and is getting better.
  2. Google Lays Out its Mobile Strategy (InformationWeek) -- notable to me for Rechis said that Google breaks down mobile users into three behavior groups: A. "Repetitive now" B. "Bored now" C. "Urgent now", a useful way to look at it. (via Tim)
  3. BP GIS and the Mysteriously Vanishing Letter -- intrigue in the geodata world. This post makes it sound as though cleanup data is going into a box behind BP's firewall, and the folks who said "um, the government should be the depot, because it needs to know it has a guaranteed-untampered and guaranteed-able-to-access copy of this data" were fired. For more info, including on the data that is available, see the geowanking thread.
  4. Streamhacker -- a blog talking about text mining and other good things, with nltk code you can run. (via heraldxchaos on Delicious)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl