Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

September 13 2013

Four short links: 13 September 2013

  1. Fog Creek’s Remote Work PolicyIn the absence of new information, the assumption is that you’re producing. When you step outside the HQ work environment, you should flip that burden of proof. The burden is on you to show that you’re being productive. Is that because we don’t trust you? No. It’s because a few normal ways of staying involved (face time, informal chats, lunch) have been removed.
  2. Coder (GitHub) — a free, open source project that turns a Raspberry Pi into a simple platform that educators and parents can use to teach the basics of building for the web. New coders can craft small projects in HTML, CSS, and Javascript, right from the web browser.
  3. MillWheel (PDF) — a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous ���ow of records, all within the envelope of the framework’s fault-tolerance guarantees. From Google Research.
  4. Probabilistic Scraping of Plain Text Tablesthe method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.

May 11 2011

Scraping, cleaning, and selling big data

In 2008, the Austin-based data startup Infochimps released a scrape of Twitter data that was later taken down at the request of the microblogging site because of user privacy concerns. Infochimps has since struck a deal with Twitter to make some datasets available on the site, and the Infochimps marketplace now contains more than 10,000 datasets from a variety of sources. Not all these datasets have been obtained via scraping, but nevertheless, the company's process of scraping, cleaning, and selling big data is an interesting topic to explore, both technically and legally.

With that in mind, Infochimps CEO Nick Ducoff, CTO Flip Kromer, and business development manager Dick Hall explain the business of data scraping in the following interview.

What are the legal implications of data scraping?

Dick HallDick Hall: There are three main areas you need to consider: copyright, terms of service, and "trespass to chattels."

United States copyright law protects against unauthorized copying of "original works of authorship." Facts and ideas are not copyrightable. However, expressions or arrangements of facts may be copyrightable. For example, a recipe for dinner is not copyrightable, but a recipe book with a series of recipes selected based on a unifying theme would be copyrightable. This example illustrates the "originality" requirement for copyright.

Let's apply this to a concrete web-scraping example. The New York Times publishes a blog post that includes the results of an election poll arranged in descending order by percentage. The New York Times can claim a copyright on the blog post, but not the table of poll results. A web scraper is free to copy the data contained in the table without fear of copyright infringement. However, in order to make a copy of the blog post wholesale, the web scraper would have to rely on a defense to infringement, such as fair use. The result is that it is difficult to maintain a copyright over data, because only a specific arrangement or selection of the data will be protected.

Most websites include a page outlining their terms of service (ToS), which defines the acceptable use of the website. For example, YouTube forbids a user from posting copyrighted materials if the user does not own the copyright. Terms of service are based in contract law, but their enforceability is a gray area in US law. A web scraper violating the letter of a site's ToS may argue that they never explicitly saw or agreed to the terms of service.

Assuming ToS are enforceable, they are a risky issue for web scrapers. First, every site on the Internet will have a different ToS — Twitter, Facebook, and The New York Times may all have drastically different ideas of what is acceptable use. Second, a site may unilaterally change the ToS without notice and maintain that continued use represents acceptance of the new ToS by a web scraper or user. For example, Twitter recently changed its ToS to make it significantly more difficult for outside organizations to store or export tweets for any reason.

There's also the issue of volume. High-volume web scraping could cause significant monetary damages to the sites being scraped. For example, if a web scraper checks a site for changes several thousand times per second, it is functionally equivalent to a denial of service attack. In this case, the web scraper may be liable for damages under a theory of "trespass to chattels," because the site owner has a property interest in his or her web servers. A good-natured web scraper should be able to avoid this issue by picking a reasonable frequency for scraping.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

What are some of the challenges of acquiring data through scraping?

Flip KromerFlip Kromer: There are several problems with the scale and the metadata, as well as historical complications.

  • Scale — It's obvious that terabytes of data will cause problems, but so (on most filesystems) will having tens of millions of files in the same directory tree.
  • Metadata — It's a chicken-and-egg problem. Since few programs can draw on rich metadata, it's not much use annotating it. But since so few datasets are annotated, it's not worth writing support into your applications. We have an internal data-description language that we plan to open source as it matures.
  • Historical complications — Statisticians like SPSS files. Semantic web advocates like RDF/XML. Wall Street quants like Mathematica exports. There is no One True Format. Lifting each out of its source domain is time consuming.

But the biggest non-obvious problem we see is source domain complexity. This is what we call the "uber" problem. A developer wants the answer to a reasonable question, such as "What was the air temperature in Austin at noon on August 6, 1998?" The obvious answer — "damn hot" — isn't acceptable. Neither is:

Well, it's complicated. See, there are multiple weather stations, all reporting temperatures — each with its own error estimate — at different times. So you simply have to take the spatial- and time-average of their reported values across the region. And by the way, did you mean Austin's city boundary, or its metropolitan area, or its downtown region?

There are more than a dozen incompatible yet fundamentally correct ways to measure time: Earth-centered? Leap seconds? Calendrical? Does the length of a day change as the earth's rotational speed does?

Data at "everything" scale is sourced by domain experts, who necessarily live at the "it's complicated" level. To make it useful to the rest of the world requires domain knowledge, and often a transformation that is simply nonsensical within the source domain.

How will data marketplaces change the work and direction of data startups?

Nick DucoffNick Ducoff: I vividly remember being taught about comparative advantage. This might age me a bit, but the lesson was: Michael Jordan doesn't mow his own lawn. Why? Because he should spend his time practicing basketball since that's what he's best at and makes a lot of money doing. The same analogy applies to software developers. If you are best at the presentation layer, you don't want to spend your time futzing around with databases

Infochimps allows these developers to spend their time doing what they do best — building apps — while we spend ours doing what we do best — making data easy to find and use. What we're seeing is startups focusing on pieces of the stack. Over time the big cloud providers will buy these companies to integrate into their stacks.

Companies like Heroku (acquired by Salesforce) and CloudKick (acquired by Rackspace) have paved the way for this. Tools like ScraperWiki and Junar will allow anybody to pull down tables off the web, and companies like Mashery, Apigee and 3scale will continue to make APIs more prevalent. We'll help make these tables and APIs findable and usable. Developers will be able to go from idea to app in hours, not days or weeks.

This interview was edited and condensed.



Related:


January 25 2011

Four short links: 25 January 2011

  1. node.io -- distributed node.js-based scraper system.
  2. Joystick-It -- adhesive joystick for the iPad. Compare the Fling analogue joystick. Tactile accessories for the iPad—hot new product category or futile attempt to make a stripped-down demi-computer into an aftermarked pimped-out hackomatic? (via Aza Raskin on Twitter)
  3. Programmed for Love (Chronicle of Higher Education) -- Sherry Turkle sees the danger in social hardware emulating emotion. Companies will soon sell robots designed to baby-sit children, replace workers in nursing homes, and serve as companions for people with disabilities. All of which to Turkle is demeaning, "transgressive," and damaging to our collective sense of humanity. It's not that she's against robots as helpers—building cars, vacuuming floors, and helping to bathe the sick are one thing. She's concerned about robots that want to be buddies, implicitly promising an emotional connection they can never deliver. (via BoingBoing)
  4. Asking the Right Questions (Expert Labs) -- Andy Baio compiled a list of how Q&A sites like StackOverflow, Quora, Yahoo! Answers, etc. steer people towards asking questions whose answers will improve the site (and away from flamage, chitchat, etc.). The secret sauce to social software is the invisible walls that steer people towards productive behaviour.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl