Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 04 2011

The truth about data: Once it's out there, it's hard to control

The amount of data being produced is increasing exponentially, which raises big questions about security and ownership. Do we need to be more concerned about the information many of us readily give out to join popular social networks, sign up for website community memberships, or subscribe to free online email? And what happens to that data once it's out there?

In a recent interview, Jeff Jonas (@JeffJonas), IBM distinguished engineer and a speaker at the O'Reilly Strata Online Conference, said consumers' willingness to give away their data is a concern, but it's perhaps secondary to the sheer number of data copies produced.

Our interview follows.

What is the current state of data security?

JeffJonas.jpg Jeff Jonas: A lot of data has been created, and a boatload more is on its way — we have seen nothing yet. Organizations now wonder how they are going to protect all this data — especially how to protect it from unintended disclosure. Healthcare providers, for example, are just as determined to prevent a "wicked leak" as anyone else. Just imagine the conversation between the CIO and the board trying to explain the risk of the enemy within — the "insider threat" — and the endless and ever-changing attack vectors.

I'm thinking a lot these days about data protection, ranging from reducing the number of copies of data to data anonymization to perpetual insider threat detection.

How are advancements in data gathering, analysis, and application affecting privacy, and should we be concerned?

Jeff Jonas: When organizations only collect what they need in order to conduct business, tell the consumer what they are collecting, why and how they are going to use it, and then use it this way, most would say "fair game." This is all in line with Fair Information Practices (FIPs).

There continues to be some progress in the area of privacy-enhancing technology. For example, tamper-resistant audit logs, which are a way to record how a system was used that even the database administrator cannot alter. On the other hand, the trend that I see involves the willingness of consumers to give up all kinds of personal data in return for some benefit — free email or a fantastic social network site, for example.

While it is hard to not be concerned about what is happening to our privacy, I have to admit that for the most part technology advances are really delivering a lot of benefit to mankind.

The Strata Online Conference, being held April 6, will look at how information — and the ability to put it to work — will shape tomorrow's markets. Scheduled speakers include: Gavin Starks from AMEE, Jeff Jonas from IBM, Chris Thorpe from Artfinder, and Ian White from Urban Mapping.

Registration is open

What are the major issues surrounding data ownership?

Jeff Jonas: If users continue to give their data away because the benefits are irresistible, then there will be fewer battles, I suppose. The truth about data is that once it is out there, it's hard to control.

I did a back of the envelope estimate a few years ago to estimate the number of copies a single piece of data may experience. Turns out the number is roughly the same as the number of licks it takes to get to the center of a Tootsie Pop — a play on an old TV commercial that basically translates to more than you can easily count.

A well-thought-out data backup strategy alone may create more than 100 copies. Then what about the operational data stores, data warehouses, data marts, secondary systems and their backups? Thousands of copies would not be uncommon. Even if a consumer thought they could own their data — which they can't in many settings — how could they ever do anything to affect it?


March 09 2011

One foot in college, one foot in business

screenshot.png In a recent interview, Joe Hellerstein, a professor in the UC Berkeley computer science department, talked about the disconnect between open source innovation and development. The problem, he said, doesn't lie with funding, but with engineering and professional development:

As I was coming up as a student, really interesting open source was coming out of universities. I'm thinking of things like the Ingres and Postgres database projects at Berkeley and the Mach operating system at Carnegie Mellon. These are things that today are parts of commercial products, but they began as blue-sky research. What has changed now is there's more professionally done open source. It's professional, but it's further disconnected from research.

A lot of the open source that's very important is really "me-too" software — so Linux was a clone of Unix, and Hadoop is a clone of Google's MapReduce. There's a bit of a disconnect between the innovation side, which the universities are good at, and the professionalism of open source that we expect today, which the companies are good at. The question is, can we put those back together through some sort of industrial-academic partnership? I'm hopeful that can be done, but we need to change our way of business.

Hellerstein pointed to the MADlib project being conducted between his group at Berkeley and the project sponsor EMC Greenplum as an example of a new partnership model that could close the gap between innovation and development.

Our sponsor would have been happy to donate money to my research funds, but I said, "You know, what I really need is engineering time."

The thing I cannot do on campus is run a professional engineering shop. There are no career incentives for people to be programmers at the university. But a company has processes and expertise, and they can hire really good people who have a career path in the company. Can we find an arrangement where those people are working on open source code in collaboration with the people at the university?

It's a different way of doing research funding. The company's contributions are not financial. The contributions are in engineering sweat. It's an interesting experiment, and it's going well so far.

In the interview Hellerstein also discusses MAD data analysis and where we are in the industrial revolution of data. The full interview is available in the following video:


Sponsored post

January 27 2011

The "dying craft" of data on discs

To prepare for next week's Strata Conference, we're continuing our series of conversations with innovators working with big data and analytics. Today, we hear from Ian White, the CEO of Urban Mapping.

Mapfluence, one of Urban Mapping's products, is a spacial database platform that aggregates data from multiple sources to deliver geographic insights to clients. GIS services online are not a completely new idea, but White said the leading players haven't "risen to the occasion." That's left open some new opportunities, particularly at the lower end of the market. Whereas traditional GIS services still often deliver data by mailing out a CD-ROM or through proprietary client-server systems, Urban Mapping is one of several companies that have updated the model to work through the browser. Their key selling point, White said, is a wider range of licensing levels that allow it to support smaller clients as well as the larger ones.

Geographic data is increasingly free, but the value proposition for companies like Urban Mapping lies in the intelligence behind the data, and the organization that makes it accessible. "We're in a phase now where we're aggregating a lot of high-value data," White said. "The next phase is to offer tools to editorially say what you want."

Urban Mapping aims to provide the domain expertise on the demographic datasets it works with, freeing clients up to focus on the intelligence revealed by the data. "A developer might spend a lot of time looking through a data catalog to find a column name. If, for example, the developer is making an application for commercial real estate and they want demographic information, they might wonder which one of 1,500 different indicators they want." Delivering the right one is obviously of a higher value than delivering a list of all 1,500. "That saves an enormous amount of time."

To achieve those time savings, Urban Mapping considers the end users and their needs when they source data. As they design the architecture around it, they think about three layers: the design layer, the application layer, and the user interface layer atop that. "We look to understand the user's ultimate purpose and then work back from there," White said, as they organize tables, add metadata, and make sure data is accessible to technical and non-technical users efficiently.

"The notion of receiving a CD in the mail, opening it, reading the manual, it's kind of a dying craft," White said. "It's unfortunate that a lot of companies have built processes around having people on staff to do this kind of work. We can effectively allow those people to work in a higher-value area of the business."

You'll find the full interview in the following video:

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will focus on the business and practice of data. The conference will provide three days of training, breakout sessions, and plenary discussions -- along with an Executive Summit, a Sponsor Pavilion, and other events showcasing the new data ecosystem.

Save 30% off registration with the code STR11RAD


January 11 2011

Backtype: Using big data to make sense of social media

Strata Conference 2011 To prepare for O'Reilly's upcoming Strata Conference, we're talking with some of the leading innovators working with big data and analytics. Today, we talk with Backtype's lead engineer, Nathan Marz.

Backtype is an "intelligence platform," a suite of tools and insights that help companies quantify and understand the impact of their social media efforts. Marz works on the back end, figuring out ways to store and process terabytes of data from Twitter, Facebook, YouTube, and millions of blogs.

The platform runs on Hadoop, and makes use of Cascading, a Java API for creating complex workflows for processing data. Marz likes working with the Java-based tool for abstracting details of Hadoop because, "I find that when you're using a custom language you end up having a lot of complexity in your program that you don't anticipate, especially when you try to do things that are more dynamic."

Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Marz has written an abstraction on top of Cascading called Cascalog, a Clojure-based query language for Hadoop inspired by Datalog. "The cool thing about Clojure is that it fully integrates with the Java programming language," Marz said. "I think one of the problems with Lisps in the past has been a lack of library support. But by being on top of the JVM, that problem is solved with Clojure." He's generally optimistic about what the functional and declarative paradigms can offer in the big data space, saying his programs are more concise and written closer to how he thinks.

Marz says he's happy with the development activity around Cascalog since he released it in April 2010 and is working on a few enhancements, including making it more expressive by adding optimized joins as well as making the query planner more intelligent by being more aggressive with, for example, push-down filtering.

You'll find the full interview in the following video:

January 06 2011

Big data faster: A conversation with Bradford Stephens

Strata Conference 2011 To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with big data and analytics. Today, we hear from Bradford Stephens, founder of Drawn to Scale.

Drawn to Scale is a database platform that works with large data sets. Stephens describes its focus as slightly different from that of other big data tools: "Other tools out there concentrate on doing complex things with your data in seconds to minutes. We really concentrate on doing simple things with your data in milliseconds."

Stephens calls such speed "user time" and he credits Drawn to Scale's performance to its indexing system working in parallel with backend batch tools. Like other big data tools, Drawn to Scale uses MapReduce and Hadoop for batch processing on the back end. But on the front end, a series of secondary indices on top of the storage layer speed up retrieval. "We find that when you index data in the manner in which you wish to use it, it's basically one single call to the disk to access it," Stephens says. "So it can be extremely fast."

Big data tools and applications will be examined at the Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Drawn to Scale's customers include organizations working with analytics, in social media, in mobile ad targeting and delivery, and also organizations with large arrays of sensor networks. While he expects to see some consolidation on the commercial side ("I see a lot of vendors out there doing similar things"), on the open source side he expects to see a proliferation of tools available in areas such as geo data and managing time series. "People have some very specific requirements that they're going to cook up in open source."

You'll find the full interview in the following video:

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...