Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 09 2012

August 04 2011

Strata Week: Hadoop adds security to its skill set

Here are a few of the data stories that caught my eye this week.

Where big data and security collide

HadoopCould security be the next killer app for Hadoop? That's what GigaOm's Derrick Harris suggests: "The open-source, data-processing tool is already popular for search engines, social-media analysis, targeted marketing and other applications that can benefit from clusters of machines churning through unstructured data — now it's turning its attention to security data." Noting the universality of security concerns, Harris suggests that "targeted applications" using Hadoop might be a solid starting point for mainstream businesses to adopt the technology.

Juniper Networks' Chris Hoff has also analyzed the connections between big data and security in a couple of recent posts on his Rational Survivability blog. Hoff contends that while we've had the capabilities to analyze security-related data for some time, that's traditionally happened with specialized security tools, meaning that insights are "often disconnected from the transaction and value of the asset from which they emanate."

Hoff continues:

Even when we do start to be able to integrate and correlate event, configuration, vulnerability or logging data, it's very IT-centric. It's very INFRASTRUCTURE-centric. It doesn't really include much value about the actual information in use/transit or the implication of how it's being consumed or related to.

But as both Harris and Hoff argue, Hadoop might help address this as it can handle all an organization's unstructured data and can enable security analysis that isn't "disconnected." And both Harris and Hoff point to Zettaset as an example of a company that is tackling big data and security analysis by using Hadoop.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

What's your most important personal data?

Concerns about data security also occur at the personal level. To that end, The Locker Project, Singly's open source project to help people collect and control their personal data, recently surveyed people about the data they see as most important.

The survey asked people to choose from the following: contacts, messages, events, check-ins, links, photos, music, movies, or browser history. The results are in, and no surprise: photos were listed as the most important, with 37% of respondents (67 out of 179) selecting that option. Forty-six people listed their contacts, and 23 said their messages were most important.

Interestingly, browser history, events, and check-ins were rated the lowest. As Singly's Tom Longson ponders:

Do people not care about where they went? Is this data considered stale to most people, and therefore irrelevant? I personally believe I can create a lot of value from Browser History and Check-ins. For example, what websites are my friends going to that I'm not? Also, what places should I be going that I'm not? These are just a couple of ideas.

But just as revealing as the ranking of data were the reasons that people gave for why certain types were most important, as you can see in the word cloud created from their responses.

Singly word cloud from data surveyClick to enlarge. See Singly's associated analysis of this data.

House panel moves forward on data retention law

The U.S. Congress is in recess now, but among the last-minute things it accomplished before vacation was passage by the House Judiciary Committee of "The Protecting Children from Internet Pornographers Act of 2011." Ostensibly aimed at helping track pedophiles and pornographers online, the bill has raised a number of concerns about Internet data and surveillance. If passed, the law would require, among other things, that Internet companies collect and retain the IP addresses of all users for at least one year.

Representative Zoe Lofgren was one of the opponents of the legislation in committee, trying unsuccessfully to introduce amendments that would curb its data retention requirements. She also tried to have the name of the law changed to the "Keep Every American's Digital Data for Submission to the Federal Government Without a Warrant Act of 2011."

In addition to concerns over government surveillance, TechDirt's Mike Masnick and the Cato Institute's Julian Sanchez have also pointed to the potential security issues that could arise from lengthy data retention requirements. Sanchez writes:

If I started storing big piles of gold bullion and precious gems in my home, my previously highly secure apartment would suddenly become laughably insecure, without my changing my security measures at all. If a company significantly increases the amount of sensitive or valuable information stored in its systems — because, for example, a government mandate requires them to keep more extensive logs — then the returns to a single successful intrusion (as measured by the amount of data that can be exfiltrated before the breach is detected and sealed) increase as well. The costs of data retention need to be measured not just in terms of terabytes, or man hours spent reconfiguring routers. The cost of detecting and repelling a higher volume of more sophisticated attacks has to be counted as well.

New data from a very old map

Gough MapAnd in more pleasant "storing old data" news: the Gough Map, the oldest surviving map of Great Britain, dating back to the 14th century, has now been digitized and made available online.

The project to digitize the map, which now resides in Oxford University's Bodleian Library took 15 months to complete. According to the Bodleian, the project explored the map's "'linguistic geographies,' that is the writing used on the map by the scribes who created it, with the aim of offering a re-interpretation of the Gough Map's origins, provenance, purpose and creation of which so little is known."

Among the insights gleaned includes the revelation that the text on the Gough Map is the work of at least two different scribes — one from the 14th century and a later one, from the 15th century, who revised some pieces. Furthermore, it was also discovered that the map was made closer to 1375 than 1360, the data often given to it.

Got data news?

Feel free to email me.


November 04 2010

Strata Week: Political lessons from data land

If you live in the United States (and maybe even if you don't), you probably spent a lot of time this week thinking about elections, politicians, and the frequent shortcomings of each. Here are some lessons from the data world that our pundits would be wise to heed.

Lesson 1: Sharing is caring

Yelp featureIf you've ever used Yelp, you're probably familiar with the "People Who Viewed this Also Viewed ..." feature (tucked in the lower-right corner of the window, below the street map). That feature, unsurprisingly, relies on the ability to process a few months' worth of access logs to analyze user behavior.

According to Dave Marin, a search and data-mining engineer at Yelp, the company generates something like 100GB of daily log data. And of course, in order to process all that data, they use distributed computing. That used to mean running MapReduce on an in-house Hadoop cluster, via a Python-package framework they call MRJob. But they found that very seldom did they make use of all the nodes ... and when they did, a large batch job would delay lots of small jobs. Le sigh.

Then -- insert sparkles here -- the team discovered Amazon's Elastic MapReduce (EMR) service. That's when they decided to migrate their entire code base over to Amazon so they could dispose of their own Hadoop cluster and rent such services on an as-needed basis. Oh, and they also decided to share MRJob with all of us, so we can help make it better (and Amazon can sell EMR services to those of us without our own Hadoop clusters?).

Writes Marin:

  • If you're new to MapReduce and want to learn about it, MRJob is for you.

  • If you want to run a huge machine learning algorithm, or do some serious log processing, but don't want to set up a Hadoop cluster, MRJob is for you.

  • If you have a Hadoop cluster already and want to run Python scripts on it, MRJob is for you.

  • If you want to migrate your Python code base off your Hadoop cluster to EMR, MRJob is for you.

  • (If you don't want to write Python, MRJob is not so much for you. But we can fix that.)

To learn more or download the code, check out the GitHub page.

Lesson 2: History matters

The Committee on Data for Science and Technology (CODATA) launched an initiative on Oct. 29 to make a global inventory of "threatened data." That includes myriad kinds of analog data, as well as digitized data that exists on older, degradable formats, such as floppy disks or magnetic tape.

One purpose of the inventory is to preserve old records and sources of data. But another purpose is to help researchers and preservationists prioritize what to save. It may not be possible to keep everything, but having an inventory will help us know where to focus our energies. As Nature News reports:

Climate-change studies, for example, require data series on temperature and rainfall reaching back further than digital records. Some scientists are having to leaf through old ships' logs for clues to past weather patterns.

Politicians, I realize, sometimes prefer old documents to stay buried. But cataloging and saving information seems like a pretty good plan to me.

Lesson 3: Forget expensive suits

David McCandless' elegant (and helpful!) chart of Who's Suing Whom in the Telecoms Trade got an update from Fortune to include Apple's newest lawsuit against Motorola.

As elegant as this graphic is, only one word comes to mind: ugh.

Lesson 4: Confusion can cost you

Nowhere is the power of plain speech more quantifiably evident this week than in Expedia's discovery that a confusing data field had been costing them $12 million a year.

Expedia analysts were trying to figure out why a number of people who had entered information like dates and credit card numbers, and then clicked the "Buy Now" button, never completed their transactions. They began correlating information about these events to discover patterns in the transaction failures.

Lo and behold, it turned out that the transactions were being rejected during credit card verification because customers were entering incorrect addresses. And why were they entering incorrect addresses, you ask?

"We had an optional field on the site under 'Name', which was 'Company'," Joe Megibow, Expedia's VP of global analytics, told "It confused some customers who filled out the 'Company' field with their bank name."

These customers then went on to enter the address of their bank in the address fields.

The fix, of course, was simply to delete the confusing field. According to Megibow, this caused a major change "overnight," leading to an additional $12 million annual profit. And he says they have identified 50-60 more such changes by using analytics.

Score one more for simplicity. For as every politician surely knows, the devil is in the details.

Send us news

Email us news, tips and interesting tidbits at

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!