Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 07 2012

Unstructured data is worth the effort when you've got the right tools


It's dawning on companies that data analysis can yield insights and inform business decisions. As data-driven benefits grow, so do our demands about what more data can tell us and what other types we can mine.

During her PhD studies, Alyona Medelyan (@zelandiya) developed Maui, an open source tool that performs as well as professional librarians in identifying main topics in documents. Medelyan now leads the research and development of API-based products at Pingar.

Pingar senior software researcher Anna Divoli (@annadivoli) studied sentence extraction for semi-automatic annotation of biological databases. Her current research focuses on developing methodologies for acquiring knowledge from textual data.

"Big data is important in many diverse areas, such as science, social media, and enterprise," observes Divoli. "Our big data niche is analysis of unstructured text." In the interview below, Medelyan and Divoli describe their work and what they see on the horizon for unstructured data analysis.

How did you get started in big data?

Anna Divoli: I began working with big data as it relates to science during my PhD. I worked with bioinformaticians who mined proteomics data. My research was on mining information from the biomedical literature that could serve as annotation in a database of protein families.

Alyona Medelyan: Like Anna, I mainly focus on unstructured data and how it can be managed using clever algorithms. During my PhD in natural language processing and data mining, I started applying such algorithms to large datasets to investigate how time-consuming data analysis and processing tasks can be automated.

What projects are you working on now?

Alyona Medelyan: For the past two years at Pingar, I've been developing solutions for enterprise customers who accumulate unstructured data and want to search, analyze, and explore this data efficiently. We develop entity extraction, text summarization, and other text analytics solutions to help scrub and interpret unstructured data in an organization.

Anna Divoli: We're focusing on several verticals that struggle with too much textual data, such as bioscience, legal, and government. We also strive to develop language-independent solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What are the trends and challenges you're seeing in the big data space?

Anna Divoli: There are plenty of trends that span various aspects of big data, such as making the data accessible from mobile devices, cloud solutions, addressing security and privacy issues, and analyzing social data.

One trend that is pertinent to us is the increasing popularity of APIs. Plenty of APIs exist that give access to large datasets, but there also powerful APIs that manage big data efficiently, such as text analytics, entity extraction, and data mining APIs.

Alyona Medelyan: The great thing about APIs is that they can be integrated into existing applications used inside an organization.

With regard to the challenges, enterprise data is very messy, inconsistent, and spread out across multiple internal systems and applications. APIs like the ones we're working on can bring consistency and structure to a company's legacy data.

The presentation you'll be giving at the Strata Conference will focus on practical applications of mining unstructured data. Why is this an important topic to address?

Anna Divoli: Every single organization in every vertical deals with unstructured data. Tons of text is produced daily — emails, reports, proposals, patents, literature, etc. This data needs to be mined to allow fast searching, easy processing, and quick decision making.

Alyona Medelyan: Big data often stands for structured data that is collected into a well-defined database — who bought which book in an online bookstore, for example. Such databases are relatively easy to mine because they have a consistent form. At the same time, there is plenty of unstructured data that is just as valuable, but it's extremely difficult to analyze it because it lacks structure. In our presentation, we will show how to detect structure using APIs, natural language processing and text mining, and demonstrate how this creates immediate value for business users.

Are there important new tools or projects on the horizon for big data?

Alyona Medelyan: Text analytics tools are very hot right now, and they improve daily as scientists come up with new ways of making algorithms understand written text more accurately. It is amazing that an algorithm can detect names of people, organizations, and locations within seconds simply by analyzing the context in which words are used. The trend for such tools is to move toward recognition of further useful entities, such as product names, brands, events, and skills.

Anna Divoli: Also, entity relation extraction is an important trend. A relation that consistently connects two entities in many documents is important information in science and enterprise alike. Entity relation extraction helps detect new knowledge in big data.

Other trends include detecting sentiment in social data, integrating multiple languages, and applying text analytics to audio and video transcripts. The number of videos grows at a constant rate, and transcripts are even more unstructured than written text because there is no punctuation. That's another exciting area on the horizon!

Who do you follow in the big data community?

Alyona Medelyan: We tend to follow researchers in areas that are used for dealing with big data, such as natural language processing, visualization, user experience, human computer information retrieval, as well as the semantic web. Two of them are also speaking at Strata this year: Daniel Tunkelang and Marti Hearst.


This interview was edited and condensed.

Related:

Reposted bycheg00 cheg00

February 03 2012

Top stories: January 30-February 3, 2012

Here's a look at the top stories published across O'Reilly sites this week.

What is Apache Hadoop?
Apache Hadoop has been the driving force behind the growth of the big data industry. But what does it do, and why do you need all its strangely-named friends? (Related: Hadoop creator Doug Cutting on why Hadoop caught on.)

Embracing the chaos of data
Data scientists, it's time to welcome errors and uncertainty into your data projects. In this interview, Jetpac CTO Pete Warden discusses the advantages of unstructured data.

Moneyball for software engineering, part 2
A look at the "Moneyball"-style metrics and techniques managers can employ to get the most out of their software teams.

With GOV.UK, British government redefines the online government platform
A new beta .gov website in Britain is open source, mobile friendly, platform agnostic, and open for feedback.


When will Apple mainstream mobile payments?
David Sims parses the latest iPhone / near-field-communication rumors and considers the impact of Apple's (theoretical) entrance into the mobile payment space.



Strata 2012, Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work. Save 20% on Strata registration with the code RADAR20.

January 31 2012

Embracing the chaos of data

A data scientist and a former Apple engineer, Pete Warden (@petewarden) is now the CTO of the new travel photography startup Jetpac. Warden will be a keynote speaker at the upcoming Strata Conference, where he'll explain why we should rethink our approach to data. Specifically, rather than pursue the perfection of structured information, Warden says we should instead embrace the chaos of unstructured data. He expands on that idea in the following interview.

What do you mean asking data scientists to embrace the chaos of data?

Pete WardenPete Warden: The heart of data science is designing instruments to turn signals from the real world into actionable information. Fighting the data providers to give you those signals in a convenient form is a losing battle, so the key to success is getting comfortable with messy requirements and chaotic inputs. As an engineer, this can feel like a deal with the devil, as you have to accept error and uncertainty in your results. But the alternative is no results at all.

Are we wasting time trying to make unstructured data structured?

Pete Warden: Structured data is always better than unstructured, when you can get it. The trouble is that you can't get it. Most structured data is the result of years of effort, so it is only available with a lot of strings, either financial or through usage restrictions.

The first advantage of unstructured data is that it's widely available because the producers don't see much value in it. The second advantage is that because there's no "structuring" work required, there's usually a lot more of it, so you get much broader coverage.

A good comparison is Yahoo's highly-structured web directory versus Google's search index built on unstructured HTML soup. If you were looking for something that was covered by Yahoo, its listing was almost always superior, but there were so many possible searches that Google's broad coverage made it more useful. For example, I hear that 30% of search queries are "once in history" events — unique combinations of terms that never occur again.

Dealing with unstructured data puts the burden on the consuming application instead of the publisher of the information, so it's harder to get started, but the potential rewards are much greater.

How do you see data tools developing over the next few years? Will they become more accessible to more people?

Pete Warden: One of the key trends is the emergence of open-source projects that deal with common patterns of unstructured input data. This is important because it allows one team to solve an unstructured-to-structured conversion problem once, and then the entire world can benefit from the same solution. For example, turning street addresses into latitude/longitude positions is a tough problem that involves a lot of fuzzy textual parsing, but open-source solutions are starting to emerge.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Associated photo on home and category pages: "mess with graphviz by Toms Bauģis, on Flickr

Related:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl