Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 03 2013

TANAGRA - Un logiciel de data mining, de statistique et d'analyse de données pour l'enseignement et…

TANAGRA - Un logiciel de data mining, de statistique et d’analyse de données pour l’enseignement et la recherche
http://eric.univ-lyon2.fr/~ricco/tanagra/fr/tanagra.html #data #datamining

May 24 2012

Four short links: 24 May 2012

  1. Last Saturday My Son Found His People at the Maker Faire -- aww to the power of INFINITY.
  2. Dictionaries Linking Words to Concepts (Google Research) -- Wikipedia entries for concepts, text strings from searches and the oppressed workers down the Text Mines, and a count indicating how often the two were related.
  3. Magic Wand (Kickstarter) -- I don't want the game, I want a Bluetooth magic wand. I don't want to click the OK button, I want to wave a wand and make it so! (via Pete Warden)
  4. E-Commerce Performance (Luke Wroblewski) -- If a page load takes more than two seconds, 40% are likely to abandon that site. This is why you should follow Steve Souders like a hawk: if your site is slower than it could be, you're leaving money on the table.

February 08 2012

Four short links: 8 February 2012

  1. Mavuno -- an open source, modular, scalable text mining toolkit built upon Hadoop. (Apache-licensed)
  2. Cow Clicker -- Wired profile of Cowclicker creator Ian Bogost. I was impressed by Cow Clickers [...] have turned what was intended to be a vapid experience into a source of camaraderie and creativity. People create communities around social activities, even when they are antisocial. (via BoingBoing)
  3. Unicode Has a Pile of Poo Character (BoingBoing) -- this is perfect.
  4. The Research Works Act and the Breakdown of Mutual Incomprehension (Cameron Neylon) -- an excellent summary of how researchers and publishers view each other and their place in the world.

February 07 2012

Unstructured data is worth the effort when you've got the right tools


It's dawning on companies that data analysis can yield insights and inform business decisions. As data-driven benefits grow, so do our demands about what more data can tell us and what other types we can mine.

During her PhD studies, Alyona Medelyan (@zelandiya) developed Maui, an open source tool that performs as well as professional librarians in identifying main topics in documents. Medelyan now leads the research and development of API-based products at Pingar.

Pingar senior software researcher Anna Divoli (@annadivoli) studied sentence extraction for semi-automatic annotation of biological databases. Her current research focuses on developing methodologies for acquiring knowledge from textual data.

"Big data is important in many diverse areas, such as science, social media, and enterprise," observes Divoli. "Our big data niche is analysis of unstructured text." In the interview below, Medelyan and Divoli describe their work and what they see on the horizon for unstructured data analysis.

How did you get started in big data?

Anna Divoli: I began working with big data as it relates to science during my PhD. I worked with bioinformaticians who mined proteomics data. My research was on mining information from the biomedical literature that could serve as annotation in a database of protein families.

Alyona Medelyan: Like Anna, I mainly focus on unstructured data and how it can be managed using clever algorithms. During my PhD in natural language processing and data mining, I started applying such algorithms to large datasets to investigate how time-consuming data analysis and processing tasks can be automated.

What projects are you working on now?

Alyona Medelyan: For the past two years at Pingar, I've been developing solutions for enterprise customers who accumulate unstructured data and want to search, analyze, and explore this data efficiently. We develop entity extraction, text summarization, and other text analytics solutions to help scrub and interpret unstructured data in an organization.

Anna Divoli: We're focusing on several verticals that struggle with too much textual data, such as bioscience, legal, and government. We also strive to develop language-independent solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What are the trends and challenges you're seeing in the big data space?

Anna Divoli: There are plenty of trends that span various aspects of big data, such as making the data accessible from mobile devices, cloud solutions, addressing security and privacy issues, and analyzing social data.

One trend that is pertinent to us is the increasing popularity of APIs. Plenty of APIs exist that give access to large datasets, but there also powerful APIs that manage big data efficiently, such as text analytics, entity extraction, and data mining APIs.

Alyona Medelyan: The great thing about APIs is that they can be integrated into existing applications used inside an organization.

With regard to the challenges, enterprise data is very messy, inconsistent, and spread out across multiple internal systems and applications. APIs like the ones we're working on can bring consistency and structure to a company's legacy data.

The presentation you'll be giving at the Strata Conference will focus on practical applications of mining unstructured data. Why is this an important topic to address?

Anna Divoli: Every single organization in every vertical deals with unstructured data. Tons of text is produced daily — emails, reports, proposals, patents, literature, etc. This data needs to be mined to allow fast searching, easy processing, and quick decision making.

Alyona Medelyan: Big data often stands for structured data that is collected into a well-defined database — who bought which book in an online bookstore, for example. Such databases are relatively easy to mine because they have a consistent form. At the same time, there is plenty of unstructured data that is just as valuable, but it's extremely difficult to analyze it because it lacks structure. In our presentation, we will show how to detect structure using APIs, natural language processing and text mining, and demonstrate how this creates immediate value for business users.

Are there important new tools or projects on the horizon for big data?

Alyona Medelyan: Text analytics tools are very hot right now, and they improve daily as scientists come up with new ways of making algorithms understand written text more accurately. It is amazing that an algorithm can detect names of people, organizations, and locations within seconds simply by analyzing the context in which words are used. The trend for such tools is to move toward recognition of further useful entities, such as product names, brands, events, and skills.

Anna Divoli: Also, entity relation extraction is an important trend. A relation that consistently connects two entities in many documents is important information in science and enterprise alike. Entity relation extraction helps detect new knowledge in big data.

Other trends include detecting sentiment in social data, integrating multiple languages, and applying text analytics to audio and video transcripts. The number of videos grows at a constant rate, and transcripts are even more unstructured than written text because there is no punctuation. That's another exciting area on the horizon!

Who do you follow in the big data community?

Alyona Medelyan: We tend to follow researchers in areas that are used for dealing with big data, such as natural language processing, visualization, user experience, human computer information retrieval, as well as the semantic web. Two of them are also speaking at Strata this year: Daniel Tunkelang and Marti Hearst.


This interview was edited and condensed.

Related:

Reposted bycheg00 cheg00

January 13 2012

Four short links: 13 January 2012

  1. How The Internet Gets Inside Us (The New Yorker) -- at any given moment, our most complicated machine will be taken as a model of human intelligence, and whatever media kids favor will be identified as the cause of our stupidity. When there were automatic looms, the mind was like an automatic loom; and, since young people in the loom period liked novels, it was the cheap novel that was degrading our minds. When there were telephone exchanges, the mind was like a telephone exchange, and, in the same period, since the nickelodeon reigned, moving pictures were making us dumb. When mainframe computers arrived and television was what kids liked, the mind was like a mainframe and television was the engine of our idiocy. Some machine is always showing us Mind; some entertainment derived from the machine is always showing us Non-Mind. (via Tom Armitage)
  2. SWFScan -- Windows-only Flash decompiler to find hardcoded credentials, keys, and URLs. (via Mauricio Freitas)
  3. Paranga -- haptic interface for flipping through an ebook. (via Ben Bashford)
  4. Facebook Gives Politico Deep Access to Users Political Sentiments (All Things D) -- Facebook will analyse all public and private updates that mention candidates and an exclusive partner will "use" the results. Remember, if you're not paying for it then you're the product and not the customer.

January 12 2012

Four short links: 12 January 2012

  1. Smart Hacking for Privacy -- can mine smart power meter data (or even snoop it) to learn what's on the TV. Wow. (You can also watch the talk). (via Rob Inskeep)
  2. Conditioning Company Culture (Bryce Roberts) -- a short read but thought-provoking. It's easy to create mindless mantras, but I've seen the technique that Bryce describes and (when done well) it's highly effective.
  3. hydrat (Google Code) -- a declarative framework for text classification tasks.
  4. Dynamic Face Substitution (FlowingData) -- Kyle McDonald and Arturo Castro play around with a face tracker and color interpolation to replace their own faces, in real-time, with celebrities such as that of Brad Pitt and Paris Hilton. Awesome. And creepy. Amen.

December 26 2011

Four short links: 26 December 2011

  1. Pattern -- a BSD-licensed bundle of Python tools for data retrieval, text analysis, and data visualization. If you were going to get started with accessible data (Twitter, Google), the fundamentals of analysis (entity extraction, clustering), and some basic visualizations of graph relationships, you could do a lot worse than to start here.
  2. Factorie (Google Code) -- Apache-licensed Scala library for a probabilistic modeling technique successfully applied to [...] named entity recognition, entity resolution, relation extraction, parsing, schema matching, ontology alignment, latent-variable generative models, including latent Dirichlet allocation. The state-of-the-art big data analysis tools are increasingly open source, presumably because the value lies in their application not in their existence. This is good news for everyone with a new application.
  3. Playtomic -- analytics as a service for gaming companies to learn what players actually do in their games. There aren't many fields untouched by analytics.
  4. Write or Die -- iPad app for writers where, if you don't keep writing, it begins to delete what you wrote earlier. Good for production to deadlines; reflective editing and deep thought not included.

December 23 2011

November 01 2011

Demoting Halder: A wild look at social tracking and sentiment analysis

I've been holding conversations with friends around a short story I put up on the Web last week, Demoting Halder, and interesting reactions have come up. Originally, the story was supposed to lay out an alternative reality where social tracking and sentiment analysis had taken over society so pervasively that everything people did revolved around them. As the story evolved, I started to wonder whether the reality in the story was an alternative one or something we are living right now. I think this is why people have been responding to the story.

The old saying, "First impressions are important" is going out of date. True, someone may form a lasting opinion of you based on the first information he or she hears, but you no longer have control over what this first information is. Businesses go to great lengths to influence what tops the results in Google and other search engines. There are court battles over ads that are delivered when people search for product names--it's still unclear whether a company can be successfully sued for buying an ad for a name trademarked by a competitor. But after all this effort, someone may hear about you first on some forum you don't even know about.

In short, by the time people call you or send email, you have no idea what they know already and what they think about you. One friend told me, "Social networking turns the whole world into one big high school (and I didn't like high school)." Nearly two years ago, I covered questions of identity online, with a look at the effects of social networking, in a series on Radar. I think it's still relevant, particularly concerning the choices it raised about how to behave on social networks, what to share, and--perhaps most importantly--how much to trust what you see about other people.

Some people assiduously monitor what comes up when they Google their name or how many followers they have on various social networks. Businesses are springing up that promise even more sophisticated ways to rank people or organizations. Some of the background checking shades over into outright stalking, where an enemy digs up obscure facts that seem damaging and posts them to a forum where they can influence people's opinion of the victim. One person who volunteered for a town commission got on the wrong side of somebody who came before the commission, and had to cope with such retaliation as having pictures of her house posted online along with nasty comments. I won't mention what she found out when she turned the tables and looked the attacker up online. After hearing her real-life experiences, I felt like my invented story will soon be treated as a documentary.

And the success characters have gaming the system in Demoting Halder should be readily believable. Today we depend heavily on ratings even thought there are scads of scams on auction sites, people using link farms and sophisticated spam-like schemes to boost search results, and skewed ratings on travel sites and similar commercial ventures.

One friend reports, "It is amazing how many people have checked me and my company out before getting on the initial call." Tellingly, she goes on to admit, "Of course, I do the same. It used to be that was rare behavior. Now it is expected that you will have this strange conversation where both parties know way too much about each other." I'm interested in hearing more reactions to the story.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

If your data practices were made public, would you be nervous?

Solon BarocasThe practice of data mining often elicits a knee-jerk reaction from consumers, with some viewing it as as a violation of their privacy. In a recent interview, Solon Barocas (@s010n), a doctoral student at New York University, discussed the perceptions of data mining and how companies can address data mining's reputation.

Highlights from the interview (below) included:

  • What do consumers think data mining entails? "Data mining almost intuitively for most consumers implies scavenging through the data, trying to find secrets that you don't necessarily want people to know," Barocas said. "It's really difficult to explain what data mining actually is. I think of it, in a sense, to be a particular form of machine learning. And these are complicated things — very, very complicated. A challenge for people in the industry, regulators, and anyone else interested in these issues, is to figure out a way to communicate these technical things to a lay audience." [Discussed at the 0:41 mark.]
  • Do we need a different phrase in lieu of "data-mining"? Barocas argued: "[We should] try to push back against the misuses of the term, re-appropriate the term data mining, and explain it's not 'data-dredging.' It's not this case of running through everyone's data. We need to instead explain data mining is a kind of analysis that lets us discover interesting and important new trends. I think there's an enormous amount of value in data mining and being able to explain precisely what that value is without making it seem like it's just snooping." [Discussed at 1:12.]
  • What "ethical red flags" should companies and data scientists be aware of? "There are potential problems all along the line," said Barocas, as after all, it can be difficult for companies performing analysis to know what to collect and what not to collect. "The rule of thumb: If your practice was made public — widely public — would you be nervous?" Barocas said he realizes that's "not a very sophisticated rule," but it's one that might guide responsibility in the data mining space. [Discussed at 2:50.]

The full interview is available in the video below:


Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.



Save 20% on registration with the code RADAR20


Some quotes from this interview were edited and condensed for clarity.

Related:

July 30 2011

Report from Open Source convention health track, 2011

Open source software in health care? It's limited to a few pockets of use--at least in the United States--but if you look at it a bit, you start to wonder why any health care institution uses any proprietary software at all.

What the evidence suggests

Take the conference session by University of Chicago researchers commissioned to produce a report for Congress on open source in health care. They found several open source packages that met the needs for electronic records at rural providers with few resources, such as safety-net providers.

They found that providers who adopted open source started to make the changes that the adoption of electronic health records (or any major new system) is supposed to do, but rarely does in proprietary health settings.

  • They offer the kinds of extra attention to patients that improve their health, such as asking them questions about long-term health issues.

  • They coordinate care better between departments.

  • They have improved their workflows, saving a lot of money

And incidentally, deployment of an open source EHR took an estimated 40% of the cost of deploying a proprietary one.

Not many clinics of the type examined--those in rural, low-income areas--have the time and money to install electronic records, and far fewer use open source ones. But the half-dozen examined by the Chicago team were clear success stories. They covered a variety of areas and populations, and three used WorldVistA while three used other EHRs.

Their recommendations are:

  • Greater coordination between open source EHR developers and communities, to explain what open source is and how they benefit providers.

  • Forming a Community of Practice on health centers using open source EHRs.

  • Greater involvement from the Federal Government, not to sponsor open source, but to make communities aware that it's an option.

Why do so few providers adopt open source EHRs? The team attributed the problem partly to prejudice against open source. But I picked up another, deeper concern from their talk. They said success in implementing open source EHRs depends on a "strong, visionary leadership team." As much as we admire health providers, teams like that are hard to form and consequently hard to find. But of course, any significant improvement in work processes would require such a team. What the study demonstrated is that it happens more in the environment of an open source product.

There are some caveats to keep in mind when considering these findings--some limitations to the study. First, the researchers had very little data about the costs of implementing proprietary health care systems, because the vendors won't allow customers to discuss it, and just two studies have been published. Second, the sample of open source projects was small, although the consistency of positive results was impressive. And the researchers started out sympathetic to open source. Despite the endorsement of open source represented by their findings, they recognized that it's harder to find open source and that all the beneficial customizations take time and money. During a Birds-of-a-Feather session later in the conference, many of us agreed that proprietary solutions are here for quite some time, and can benefit by incorporating open source components.

The study nevertheless remains important and deserves to be released to Congress and the public by the Department of Health and Human Services. There's no point to keeping it under wraps; the researchers are proceeding with phase 2 of the study with independent funding and are sure to release it.

So who uses open source?

It's nice to hear about open source projects (and we had presentations on several at last year's OSCon health care track) but the question on the ground is what it's like to actually put one in place. The implementation story we heard this year was from a team involving Roberts-Hoffman Software and Tolven.

Roberts-Hoffman is an OSCon success story. Last year they received a contract from a small health care provider to complete a huge EHR project in a crazily short amount of time, including such big-ticket requirements as meeting HIPAA requirements. Roberts-Hoffman knew little about open source, but surmised that the customization it permitted would let them achieve their goal. Roberts-Hoffman CEO Vickie Hoffman therefore attended OSCon 2010, where she met a number of participants in the health care track (including me) and settled on Tolven as their provider.

The customer put some bumps in the road to to the open source approach. For instance, they asked with some anxiety whether an open source product would expose their data. Hoffman had a little educating to do.

Another hurdle was finding a vendor to take medication orders. Luckily, Lexicomp was willing to work with a small provider and showed a desire to have an open source solution for providers. Roberts-Hoffman ended up developing a Tolven module using Lexicomp's API and contributing it back to Tolven. This proprietary/open source merger was generally quite successful, although it was extra work providing tests that someone could run without a Lexicomp license.

In addition to meeting what originally seemed an impossible schedule, Tolven allowed an unusual degree of customization through templating, and ensured the system would work with standard medical vocabularies.

Why can't you deliver my data?

After presentations on health information exchanges at OSCON, I started to ruminate about data delivery. My wife and I had some problems with appliances this past Spring and indulged in some purchases of common household items, a gas grill from one company and a washing machine from another. Each offered free delivery. So if low-margin department stores can deliver 100-pound appliances, why can't my doctor deliver my data to a specialist I'm referred to?

The CONNECT Gateway and Direct project hopefully solve that problem. CONNECT is the older solution, with Direct offering an easier-to-implement system that small health care providers will appreciate. Both have the goal of allowing health care providers to exchange patient data with each other, and with other necessary organizations such as public health agencies, in a secure manner.

David Riley, who directed the conversion of CONNECT to an open-source, community-driven project at the Office of the National Coordinator in the Department of Health and Human Services, kicked off OSCon's health care track by describing the latest developments. He had led off last year's health care track with a perspective on CONNECT delivered from his role in government, and he moved smoothly this time into covering the events of the past year as a private developer.

The open-source and community aspects certainly proved their value when a controversy and lawsuit over government contracts threatened to stop development on CONNECT. Although that's all been resolved now, Riley decided in the Spring to leave government and set up an independent non-profit foundation, Alembic, to guide CONNECT. The original developers moved over to Alembic, notably Brian Behlendorf, and a number of new companies and contributors came along. Most of the vendors who had started out on the ONC project stayed with the ONC, and were advised by Riley to do so until Alembic's course was firm.

Lots of foundations handle open source projects (Apache, etc.) but Riley and Behlendorf decided none of them were proper for a government-centric health care project. CONNECT demanded a unique blend of sensitivity to the health care field and experience dealing with government agencies, who have special contract rules and have trouble dealing with communities. For instance, government agencies are tasked by Congress with developing particular solutions in a particular time frame, and cannot cite as an excuse that some developer had to take time off to get a full-time job elsewhere.

Riley knows how to handle the myriad pressures of these projects, and has brought that expertise to Alembic. CONNECT software has been released and further developed under a BSD license as the Aurion project. Now that the ONC is back on track and is making changes of its own, the two projects are trying to heal the fork and are following each other's changes closely. Because Aurion has to handle sensitive personal data deftly, Riley hopes to generalize some of the software and create other projects for handling personal data.

Two Microsoft staff came to OSCon to describe Direct and the open-source .NET libraries implementing it. It turned out that many in the audience were uninformed about Direct (despite an intense outreach effort by the ONC) and showed a good deal of confusion about it. So speakers Vaibhav Bhandari and Ali Emami spent the whole time alloted (and more) explaining Direct, with time for just a couple slides pointing out what the .NET libraries can do.

Part of the problem is that security is broken down into several different functions in ONC's solution. Direct does not help you decide whether to trust the person you're sending data to (you need to establish a trust relationship through a third party that grants certificates) or find out where to send it (you need to know the correspondent's email address or another connection point). But two providers or other health care entities who make an agreement to share data can use Direct to do so over email or other upcoming interfaces.

There was a lot of cynicism among attendees and speakers about whether government efforts, even with excellent protocols and libraries, can get doctors to offer patients and other doctors the necessary access to data. I think the reason I can get a big-box store to deliver an appliance but I can't get my doctor to deliver data is that the big-box store is part of a market, and therefore wants to please the customer. Despite all our talk of free markets in this country, health care is not a market. Instead, it's a grossly subsidized system where no one has choice. And it's not just the patients who suffer. Control is removed from the providers and payers as well.

The problem will be solved when patients start acting like customers and making appropriate demands. If you could say, "I'm not filling out those patient history forms one more time--you just get the information where I'm going," it might have an effect. More practically speaking, let's provide simple tools that let patients store their history on USB keys or some similar medium, so we can walk into a doctor's office and say "Here, load this up and you'll have everything you need."

What about you, now?

Patient control goes beyond data. It's really core to solving our crisis in health care and costs. A lot of sessions at OSCon covered things patients could do to take control of their health and their data, but most of them were assigned to the citizen health track (I mentioned them at the end of my preview article a week ago) and I couldn't attend them because they were concurrent with the health care track.

Eri Gentry delivered an inspiring keynote about her work in the biology start-up BioCurious, Karen Sandler (who had spoken in last year's health care track scared us all with the importance of putting open source software in medical devices, and Fred Trotter gave a brief but riveting summary of the problems in health care. Fred also led a session on the Quantified Self, which was largely a discussion with the audience about ways we could encourage better behavior in ourselves and the public at large.

Guaranteed to cause meaningful change

I've already touched on the importance of changing how most health care institutions treat patients, and how open source can help. David Uhlman (who has written a book for O'Reilly with Fred Trotter) covered the complex topic of meaningful use, a phrase that appeared in the recovery act of 2009 and that drives just about all the change in current U.S. institutions. The term "meaningful use" implies that providers do more than install electronic systems; they use them in ways that benefit the patients, the institutions themselves, and the government agencies that depend on their data and treatments.

But Uhlman pointed out that doctors and health administrators--let alone the vendors of EHRs--focus on the incentive money and seem eager to do the minimum that gets them a payout. This is self-defeating, because as the government will raise the requirements for meaningful use over the years, and will overwhelm quick-and-dirty implementations that fail to solve real problems. Of course, the health providers keep pushing back the more stringent requirements to later years, but they'll have to face the music someday. Perhaps the delay will be good for everyone in the long run, because it will give open source products a chance to demonstrate their value and make inroads where they are desperately needed.

As a crude incentive to install electronic records, meaningful use has been a big success. Before the recover act was passed, 15%-20% of U.S. providers had EHRs. Now the figures is 60% or 70% percent, and by the end of 2012 it will probably be 90%. But it remains to be seen whether doctors use these systems to make better clinical decisions, follow up with patients so they comply with treatments, and eliminate waste.

Uhlman said that technology accounts for about 20% of the solution. The rest is workflow. For instance, every provider should talk to patients on every visit about central health concerns, such as hypertension and smoking. Research has suggested that this will add 30% more time per visit. If it reduces illness and hospital admissions, of course, we'll all end up paying less in taxes and insurance. His slogan: meaningful use is a payout for quality data.

It may be surprising--especially to an OSCon audience--that one of the biggest hurdles to achieving meaningful use is basic computer skills. We're talking here about typing information in correctly, knowing that you need to scroll down to look at all information on the screen, and such like. All the institutions Uhlman visits think they're in fine shape and everybody has the basic skills, but every examination he's done proves that 20%-30% of the staff are novices in computer use. And of course, facilities are loath to spend extra money to develop these skills.

Open source everywhere

Open source has image and marketing problems in the health care field, but solutions are emerging all over the place. Three open source systems right now are certified for meaningful use: ClearHealth (Uhlman's own product), CareVue from MedSphere, and WorldVistA. OpenEMR is likely to join them soon, having completed the testing phase. vxVistA is certified but may depend on some proprietary pieces (the status was unclear during the discussion).

Two other intriguing projects presented at OSCon this year were popHealth and Indivo X. I interviewed architects from Indivo X and popHealth before they came to speak at OSCon. I'll just say here that popHealth has two valuable functions. It helps providers improve quality by providing a simple web interface that makes it easy for them to view and compare their quality measures (for instance, whether they offered appropriate treatment for overweight patients). Additionally, popHealth saves a huge amount of tedious manual effort by letting them automatically generate reports about these measures for government agencies. Indivo fills the highly valued space of personal health records. It is highly modular, permitting new data sources and apps to be added; in fact, speaker Daniel Haas wants it to be an "app store" for medical applications. Both projects use modern languages, frameworks, and databases, facilitating adoption and use.

Other health care track sessions

An excellent and stimulating track was rounded out with several other talks.

Shahid Shah delivered a talk on connecting medical devices to electronic record systems. He adroitly showed how the data collected from these devices is the most timely and accurate data we can get (better than direct reports from patients or doctors, and faster than labs), but we currently let it slip away from us. He also went over standard pieces of the open source stacks that facilitate the connection of devices, talked a bit about regulations, and discussed the role of routine engineering practices such as risk assessments and simulations.

Continuing on the quality theme, David Richards mentioned some lessons he learned designing a ways clinical decision support system. It's a demanding discipline. Accuracy is critical, but results must be available quickly so the doctor can use them to make decisions during the patient visit. Furthermore, the suggestions returned must be clear and precise.

Charlie Quinn talked about the collection of genetic information to achieve earlier diagnoses of serious conditions. I could not attend his talk because I was needed at another last-minute meeting, but I sat down for a while with him later.

The motto at his Benaroya Research Institute is to have diagnosis be more science, less art. With three drops of blood, they can do a range of tests on patients suspected of having particular health conditions. Genomic information in the blood can tell a lot about health, because blood contains viruses and other genomic material besides the patient's own genes.

Tests can compare the patients to each other and to a healthy population, narrowing down comparisons by age, race, and other demographics. As an example, the institute took samples before a vaccine was administered, and then at several frequent intervals in the month afterward. They could tell when the vaccine had the most powerful effect on the body.

The open source connection here is the institute's desire to share data among multiple institutions so that more patients can be compared and more correlations can be made. Quinn said it's hard to get institutions to open up their data.

All in all, I was energized by the health care track this year, and really impressed with the knowledge and commitment of the people I met. Audience questions were well-informed and contributed a lot to the presentations. OSCon shows that open source health care, although it hasn't broken into the mainstream yet, already inspires a passionate and highly competent community.

July 14 2011

Four short links: 14 July 2011

  1. Digging into Technology's Past -- stories of the amazing work behind the visual 6502 project and how they reconstructed and simulated the legendary 6502 chip. To analyze and then preserve the 6502, James treated it like the site of an excavation. First, he needed to expose the actual chip by removing its packaging of essentially “billiard-ball plastic.” He eroded the casing by squirting it with very hot, concentrated sulfuric acid. After cleaning the chip with an ultrasonic cleaner—much like what’s used for dentures or contact lenses—he could see its top layer.
  2. Leaflet -- BSD-licensed lightweight Javascript library for interactive maps, using the Open Street Map.
  3. Too Many Public Works Built on Rosy Scenarios (Bloomberg) -- a feedback loop with real data being built to improve accuracy estimating infrastructure project costs. He would like to see better incentives -- punishment for errors, rewards for accuracy -- combined with a requirement that forecasts not only consider the expected characteristics of the specific project but, once that calculation is made, adjust the estimate based on an “outside view,” reflecting the cost overruns of similar projects. That way, the “unexpected” problems that happen over and over again would be taken into consideration. Such scrutiny would, of course, make some projects look much less appealing -- which is exactly what has happened in the U.K., where “reference-class forecasting” is now required. “The government stopped a number of projects dead in their tracks when they saw the forecasts,” Flyvbjerg says. “This had never happened before.”
  4. Neurovigil Gets Cash Injection To Read Your Mind (FastCompany) -- "an anonymous American industrialist and technology visionary" put tens of millions into this company, which has hardware to gather mineable data. iBrain promises to open a huge pipeline of data with its powerful but simple brain-reading tech, which is gaining traction thanks to technological advances. But the other half of the potentailly lucrative equation is the ability to analyze the trove of data coming from iBrain. And that's where NeuroVigil's SPEARS algorithm enters the picture. Not only is the company simplifying collection of brain data with a device that can be relatively comfortably worn during all sorts of tasks--sleeping, driving, watching advertising--but the combination of iBrain and SPEARS multiplies the efficiency of data analysis. (via Vaughan Bell)

June 03 2011

Four short links: 3 June 2011

  1. Silk Road (Gawker) -- Tor-delivered "web" site that is like an eBay for drugs, currency is Bitcoins. Jeff Garzik, a member of the Bitcoin core development team, says in an email that bitcoin is not as anonymous as the denizens of Silk Road would like to believe. He explains that because all Bitcoin transactions are recorded in a public log, though the identities of all the parties are anonymous, law enforcement could use sophisticated network analysis techniques to parse the transaction flow and track down individual Bitcoin users. "Attempting major illicit transactions with bitcoin, given existing statistical analysis techniques deployed in the field by law enforcement, is pretty damned dumb," he says. The site is viewable here, and here's a discussion of delivering hidden web sites with Tor. (via Nelson Minar)
  2. Dr Waller -- a big game using DC Comics characters where players end up crowdsourcing science on GalaxyZoo. A nice variant on the captcha/ESP-style game that Luis von Ahn is known for. (via BoingBoing)
  3. Machine Learning Demos -- hypnotically beautiful. Code for download.
  4. Esper -- stream event processing engine, GPLv2-licensed Java. (via Stream Event Processing with Esper and Edd Dumbill)

Reposted bydatenwolf datenwolf

May 24 2011

Four short links: 24 May 2011

  1. Delivereads -- genius idea, a mailing list for Kindles. Yes, if you can send email then you can be a Kindle publisher. (via Sacha Judd)
  2. Abnormal Returns From the Common Stock Investments of Members of the U.S. House of Representatives -- We measure abnormal returns for more than 16,000 common stock transactions made by approximately 300 House delegates from 1985 to 2001. Consistent with the study of Senatorial trading activity, we find stocks purchased by Representatives also earn significant positive abnormal returns (albeit considerably smaller returns). A portfolio that mimics the purchases of House Members beats the market by 55 basis points per month (approximately 6% annually). (via Ellen Miller)
  3. Google News Archive Ends -- hypothesizes that old material was "too hard" to make sense of, but that seems unlikely to me. More likely is that it wasn't useful enough to their machine learning efforts. Newspapers can have their scanned/OCRed content for free now the program is being closed.
  4. Week Report 310 -- BERG's first (that I've seen) video report of the week, and it's a cracker. No newsreel, just some really clever evocation of the mood of the place and the nature of the projects. I continue to be impressed by the BERG crew's conscious creation of culture.

April 28 2011

Strata Week: Overcharging algorithms

Here are a few of the data stories that caught my eye this week.

When algorithms overcharge on Amazon

A postdoc in Michael Eisen's lab at UC Berkeley logged in to Amazon a couple of weeks ago in order to purchase a copy of Peter Lawrence's "The Making of a Fly." Although out of print, the book is a classic in the field of evolutionary biology, and there were several copies available, both new and used. The used copies were on sale for roughly $35. The two new copies were priced a bit higher: $1.7 and $2.1 million. Although he assumed at first it was a mistake, when Eisen returned to the page the next day, he found the price had gone up, with both books for sale around $2.8 million. By the end of the day, the price of one was raised again, to more than $3.5 million.

Making of a Fly review
Some folks got creative in response to the multi-million-dollar price tag attached to "The Making of a Fly."

Eisen worked out that once a day, one of the sellers was setting his price to be .9983 times the price of the copy offered by the other. The price of that seller's book was increasing at 1.270589 times the other's. Both were using algorithmic pricing, a common practice with vendors on Amazon and with Amazon itself, in order to automatically change the prices based on a competitor's.

It's obvious why one vendor would establish an algorithm to perpetually undercut the competition. Less clear, why the other would choose to always price higher. It's possible that the vendor was hoping that high ratings would compel customers to pay the higher price. But Eisen thinks it's more likely that the vendor didn't actually own a copy of the book, and set the algorithm to aim for a higher price so as to cover acquisition costs.

Eisen wrote:

What's fascinating about all this is both the seemingly endless possibilities for both chaos and mischief. It seems impossible that we stumbled onto the only example of this kind of upward pricing spiral — all it took were two sellers adjusting their prices in response to each other by factors whose products were greater than 1. And while it might have been more difficult to deconstruct, one can easily see how even more bizarre things could happen when more than two sellers are in the game. And as soon as it was clear what was going on here, I and the people I talked to about this couldn't help but start thinking about ways to exploit our ability to predict how others would price their books down to the 5th significant digit — especially when they were clearly not paying careful attention to what their algorithms were doing.

Eventually someone noticed, and the price dropped to around $150.

White hot Hadoop

HadoopYahoo is considering spinning off its Hadoop engineering unit into a new company, according to a story this week in The Wall Street Journal. Yahoo didin't comment for that story, but the piece cites Benchmark Capital partner Rob Bearden as saying that the venture capital firm has spoken to Yahoo about how it might form a separate Hadoop-oriented company.

The article posits that the Hadoop market is a multi-billion dollar one and that the opportunity is huge for Yahoo, something that GigaOm's Derrick Harris examines with a more nuanced eye to the market. "For Hadoop users and startups building tools atop Hadoop, though," Harris concludes, "more competition among distributions is only good news."

U.S. Supreme Court weighs legality of data mining

The U.S. Supreme Court heard oral arguments this week in "Sorrel v IMS Health," a case that will determine the constitutionality of a Vermont law restricting the commercial distribution of a physician's prescription records. The outcome could set important precedents in privacy and data issues.

In 2007, Vermont's legislature passed the Prescription Confidentiality Law, giving doctors the ability to deny pharmacies the option of selling their prescription information to data-mining companies. IMS Health, along with two other data-collection companies and PhRMA, a pharmaceutical industry association, challenged the constitutionality of the law, arguing it would make it more difficult for drugmakers to identify doctors for potential sales.

SCOTUSblog's Lyle Denniston reports that the justices grilled the Vermont Attorney General about the law, questioning whether it was written too narrowly — targeting only the pharmacies and not insurance companies, for example — or whether it served to protect doctors' privacy.

Denniston wrote:

... it became very clear that the Justices — perhaps more than a simple majority — see this first test case as one about corporate free speech. That might not turn out to be true in every case of data-mining that comes along, but it would certainly seem so when a legislature blatantly sets out to curb the use of that technology to convey a commercial message, made up of truthful information.

The Supreme Court is expected to announce its decision this summer.

Got data news?

Feel free to email me.



Related:


February 25 2011

January 26 2011

Four short links: 26 January 2011

  1. Find Communities -- algorithm for uncovering communities in networks of millions of nodes, for producing identifiable subgroups as in LinkedIn InMaps. (via Matt Biddulph's Delicious links)
  2. Seven Ways to Think Like The Web (Jon Udell) -- seven principles that will head off a lot of mistakes. They should be seared into the minds of anyone working in the web. 2. Pass by reference rather than by value. [pass URLs, not copies of data] [...] Why? Nobody else cares about your data as much as you do. If other people and other systems source your data from a canonical URL that you advertise and control, then they will always get data that’s as timely and accurate as you care to make it.
  3. Wire It -- an open-source javascript library to create web wirable interfaces for dataflow applications, visual programming languages, graphical modeling, or graph editors. (via Pete Warden)
  4. Interview with Marco Arment (Rands in Repose) -- Most people assume that online readers primarily view a small number of big-name sites. Nearly everyone who guesses at Instapaper’s top-saved-domain list and its proportions is wrong. The most-saved site is usually The New York Times, The Guardian, or another major traditional newspaper. But it’s only about 2% of all saved articles. The top 10 saved domains are only about 11% of saved articles. (via Courtney Johnston's Instapaper Feed)

December 19 2010

Strata gems: What your inbox knows

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: A sense of self.

Strata 2011 One of our themes at Strata is data in the dirt: mining the data exhaust from our lives to find meaning and value. In every organization, the trails left by email offer one of those repositories of hidden meaning.

Trampoline Systems's SONAR CRM takes an innovative approach to customer relationship management by mining the social networks created with and between companies. Through its integration with email logs, existing CRM systems and social networks, SONAR expands the scope of traditional CRM to give a fuller view of an company's relationships.

There is often more truth to be found in mining implicit data trails than by relying on explicitly logged information. Trampoline estimate that only 25% of actual contacts are recorded in CRM systems. By analyzing email flows, their system lets organizations understand who is talking to whom.

At O'Reilly, we specialize in connecting people across and within technical community "tribes". We've been experimenting with SONAR for some months. In my experience, it certainly contains the same knowledge about our contacts that I would otherwise have to obtain by asking around.

Email contact visualization
A SONAR visualization of some of O'Reilly's internal relationships

The more information you feed a system such as SONAR, the better results you can get. For instance, not all prodigious communicators are at the same level of influence: customer service personnel talk to as many people as business development, for instance, but the relationships they develop are of a more fleeting nature.

  • For a personal view on email analytics, Xobni offer an Outlook plugin that augments your email with information from social networks and analytical capabilities.

December 17 2010

Four short links: 17 December 2010

  1. Down the ls(1) Rabbit Hole -- exactly how ls(1) does what it does, from logic to system calls to kernel. This is the kind of deep understanding of systems that lets great programmers cut great code. (via Hacker News)
  2. Towards a scientific concept of free will as a biological trait: spontaneous actions and decision-making in invertebrates (Royal Society) -- peer-reviewed published paper that was initially reviewed and improved in Google Docs and got comments there, in FriendFeed, and on his blog. The bitter irony: Royal Society charged him €2000 to make it available for free download. (via Fabiana Kubke)
  3. Bixo -- an open source web mining toolkit. (via Matt Biddulph on Delicious)
  4. How Facebook Does Design -- podcast (with transcript) with stories about how tweaking design improved the user activity on Facebook. One of the designers thought closing your account should be more like leaving summer camp (you know a place which has all your friends, and you don’t want to leave.) So he created this page above for deactivation which has all your friends waving good-bye to you as you deactivate. Give you that final tug of the heart before you leave. This reduced the deactivation rate by 7%.

December 03 2010

Four short links: 3 December 2010

  1. Data is Snake Oil (Pete Warden) -- data is powerful but fickle. A lot of theoretically promising approaches don't work because there's so many barriers between spotting a possible relationship and turning it into something useful and actionable. This is the pin of reality which deflates the bubble of inflated expectations. Apologies for the camel's nose of rhetoric poking under the metaphoric tent.
  2. XML vs the Web (James Clark) -- resignation and understanding from one of the markup legends. I think the Web community has spoken, and it's clear that what it wants is HTML5, JavaScript and JSON. XML isn't going away but I see it being less and less a Web technology; it won't be something that you send over the wire on the public Web, but just one of many technologies that are used on the server to manage and generate what you do send over the wire. (via Simon Willison)
  3. Understanding Pac Man Ghost Behaviour -- The ghosts’ AI is very simple and short-sighted, which makes the complex behavior of the ghosts even more impressive. Ghosts only ever plan one step into the future as they move about the maze. Whenever a ghost enters a new tile, it looks ahead to the next tile that it will reach, and makes a decision about which direction it will turn when it gets there. Really detailed analysis of just one component of this very successful game. (via Hacker News)
  4. The Full Stack (Facebook) -- we like to think that programming is easy. Programming is easy, but it is difficult to solve problems elegantly with programming. I like to think that a CS education teaches you this kind of "full stack" approach to looking at systems, but I suspect it's a side-effect and not a deliberate output. This is the core skill of great devops: to know what's happening up and down the stack so you're not solving a problem at level 5 that causes problems at level 3.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl