Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

February 26 2014

Four short links: 26 February 2014

  1. Librarybox 2.0fork of PirateBox for the TP-Link MR 3020, customized for educational, library, and other needs. Wifi hotspot with free and anonymous file sharing. v2 adds mesh networking and more. (via BoingBoing)
  2. Chicago PD’s Using Big Data to Justify Racial Profiling (Cory Doctorow) — The CPD refuses to share the names of the people on its secret watchlist, nor will it disclose the algorithm that put it there. [...] Asserting that you’re doing science but you can’t explain how you’re doing it is a nonsense on its face. Spot on.
  3. Cloudwash (BERG) — very good mockup of how and why your washing machine might be connected to the net and bound to your mobile phone. No face on it, though. They’re losing their touch.
  4. What’s Left of Nokia to Bet on Internet of Things (MIT Technology Review) — With the devices division gone, the Advanced Technologies business will cut licensing deals and perform advanced R&D with partners, with around 600 people around the globe, mainly in Silicon Valley and Finland. Hopefully will not devolve into being a patent troll. [...] “We are now talking about the idea of a programmable world. [...] If you believe in such a vision, as I do, then a lot of our technological assets will help in the future evolution of this world: global connectivity, our expertise in radio connectivity, materials, imaging and sensing technologies.”

January 27 2014

Four short links: 27 January 2014

  1. Druid — open source clustered data store (not key-value store) for real-time exploratory analytics on large datasets.
  2. It’s Time to Engineer Some Filter Failure (Jon Udell) — Our filters have become so successful that we fail to notice: We don’t control them, They have agendas, and They distort our connections to people and ideas. That idea that algorithms have agendas is worth emphasising. Reality doesn’t have an agenda, but the deployer of a similarity metric has decided what features to look for, what metric they’re optimising, and what to do with the similarity data. These are all choices with an agenda.
  3. Capstone — open source multi-architecture disassembly engine.
  4. The Future of Employment (PDF) — We note that this prediction implies a truncation in the current trend towards labour market polarization, with growing employment in high and low-wage occupations, accompanied by a hollowing-out of middle-income jobs. Rather than reducing the demand for middle-income occupations, which has been the pattern over the past decades, our model predicts that computerisation will mainly substitute for low-skill and low-wage jobs in the near future. By contrast, high-skill and high-wage occupations are the least susceptible to computer capital. (via The Atlantic)

December 16 2013

Four short links: 16 December 2013

  1. Suro (Github) — Netflix data pipeline service for large volumes of event data. (via Ben Lorica)
  2. NIPS Workshop on Data Driven Education — lots of research papers around machine learning, MOOC data, etc.
  3. Proofist — crowdsourced proofreading game.
  4. 3D-Printed Shoes (YouTube) — LeWeb talk from founder of the company, Continuum Fashion). (via Brady Forrest)

December 03 2013

Four short links: 3 December 2013

  1. SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
  2. madliban open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
  3. Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
  4. Disguise Detection — using Raspberry Pi, Arduino, and Python.

June 07 2013

Four short links: 7 June 2013

  1. Accumulo — NSA’s BigTable implementation, released as an Apache project.
  2. How the Robots Lost (Business Week) — the decline of high-frequency trading profits (basically, markets worked and imbalances in speed and knowledge have been corrected). Notable for the regulators getting access to the technology that the traders had: Last fall the SEC said it would pay Tradeworx, a high-frequency trading firm, $2.5 million to use its data collection system as the basic platform for a new surveillance operation. Code-named Midas (Market Information Data Analytics System), it scours the market for data from all 13 public exchanges. Midas went live in February. The SEC can now detect anomalous situations in the market, such as a trader spamming an exchange with thousands of fake orders, before they show up on blogs like Nanex and ZeroHedge. If Midas sees something odd, Berman’s team can look at trading data on a deeper level, millisecond by millisecond.
  3. PRISM: Surprised? (Danny O’Brien) — I really don’t agree with the people who think “We don’t have the collective will”, as though there’s some magical way things got done in the past when everyone was in accord and surprised all the time. It’s always hard work to change the world. Endless, dull hard work. Ten years later, when you’ve freed the slaves or beat the Nazis everyone is like “WHY CAN’T IT BE AS EASY TO CHANGE THIS AS THAT WAS, BACK IN THE GOOD OLD DAYS. I GUESS WE’RE ALL JUST SHEEPLE THESE DAYS.”
  4. What We Don’t Know About Spying on Citizens is Scarier Than What We Do Know (Bruce Schneier) — The U.S. government is on a secrecy binge. It overclassifies more information than ever. And we learn, again and again, that our government regularly classifies things not because they need to be secret, but because their release would be embarrassing. Open source BigTable implementation: free. Data gathering operation around it: $20M/year. Irony in having the extent of authoritarian Big Brother government secrecy questioned just as a whistleblower’s military trial is held “off the record”: priceless.

April 01 2013

Four short links: 1 April 2013

  1. MLDemosan open-source visualization tool for machine learning algorithms created to help studying and understanding how several algorithms function and how their parameters affect and modify the results in problems of classification, regression, clustering, dimensionality reduction, dynamical systems and reward maximization. (via Mark Alen)
  2. kiln (GitHub) — open source extensible on-device debugging framework for iOS apps.
  3. Industrial Internet — the O’Reilly report on the industrial Internet of things is out. Prasad suggests an illustration: for every car with a rain sensor today, there are more than 10 that don’t have one. Instead of an optical sensor that turns on windshield wipers when it sees water, imagine the human in the car as a sensor — probably somewhat more discerning than the optical sensor in knowing what wiper setting is appropriate. A car could broadcast its wiper setting, along with its location, to the cloud. “Now you’ve got what you might call a rain API — two machines talking, mediated by a human being,” says Prasad. It could alert other cars to the presence of rain, perhaps switching on headlights automatically or changing the assumptions that nearby cars make about road traction.
  4. Unique in the Crowd: The Privacy Bounds of Human Mobility (PDF, Nature) — We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals. As Edd observed, “You are a unique snowflake, after all.” (via Alasdair Allan)

March 18 2013

Four short links: 18 March 2013

  1. A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method (PDF) — This project was simultaneously an experiment in developing quantitative and computational methods for tracing changes in literary language. We wanted to see how far quantifiable features such as word usage could be pushed toward the investigation of literary history. Could we leverage quantitative methods in ways that respect the nuance and complexity we value in the humanities? To this end, we present a second set of results, the techniques and methodological lessons gained in the course of designing and running this project. Even litcrit becoming a data game.
  2. Easy6502get started writing 6502 assembly language. Fun way to get started with low-level coding.
  3. How Analytics Really Work at a Small Startup (Pete Warden) — The key for us is that we’re using the information we get primarily for decision-making (should we build out feature X?) rather than optimization (how can we improve feature X?). Nice rundown of tools and systems he uses, with plug for KissMetrics.
  4. webgl-heatmap (GitHub) — a JavaScript library for high performance heatmap display.

February 14 2013

Four short links: 14 February 2013

  1. Welcome to the Malware-Industrial Complex (MIT) — brilliant phrase, sound analysis.
  2. Stupid Stupid xBoxThe hardcore/soft-tv transition and any lead they feel they have is simply not defensible by licensing other industries’ generic video or music content because those industries will gladly sell and license the same content to all other players. A single custom studio of 150 employees also can not generate enough content to defensibly satisfy 76M+ customers. Only with quality primary software content from thousands of independent developers can you defend the brand and the product. Only by making the user experience simple, quick, and seamless can you defend the brand and the product. Never seen a better put statement of why an ecosystem of indies is essential.
  3. Data Feedback Loops for TV (Salon) — Netflix’s data indicated that the same subscribers who loved the original BBC production also gobbled down movies starring Kevin Spacey or directed by David Fincher. Therefore, concluded Netflix executives, a remake of the BBC drama with Spacey and Fincher attached was a no-brainer, to the point that the company committed $100 million for two 13-episode seasons.
  4. wrka modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue.

February 12 2013

Four short links: 12 February 2013

  1. Your USB Sticks Are Made With Chopsticks (Bunnie Huang) — behind-the-scenes on how USB sticks are made.
  2. mutetab — find and kill the Chrome tab making all the damn noise! (via Nelson Minar)
  3. Visualization, Modeling, and Surprises (John D Cook) — paraphrases Hadley Wickham: Visualization can surprise you, but it doesn’t scale well. Modelling scales well, but it can’t surprise you.
  4. Head Like an Orange — science animated GIFs, assembled from nature documentaries. (via Ed Yong)

January 17 2013

Four short links: 17 January 2013

  1. Free Book Sifter — lists all the free books on Amazon, has RSS feeds and newsletters. (via BoingBoing)
  2. Whom the Gods Would Destroy, They First Give Realtime Analytics — a few key reasons why truly real-time analytics can open the door to a new type of (realtime!) bad decision making. [U]ser demographics could be different day over day. Or very likely, you could see a major difference in user behavior immediately upon releasing a change, only to watch it evaporate as users learn to use new functionality. Given all of these concerns, the conservative and reasonable stance is to only consider tests that last a few days or more.
  3. Web Book Boilerplate (Github) — uses plain old markdown and generates a well structured HTML version of your written words. Since it’s sitting on top of Pandoc and Grunt, you can easily make your books available for every platform. MIT-style license.
  4. Raspberry Pi Education Manual (PDF) — from Scratch to Python and HCI all via the Raspberry Pi. Intended to be informative and a series of lessons for teachers and students learning coding with the Raspberry Pi as their first device.

November 07 2012

Four short links: 8 November 2012

  1. Closely — new startup by Perry Evans (founder of MapQuest), giving businesses a simple app to track competitors’ online deals and social media activity. Seems a genius move to me: so many businesses flounder online, “I don’t know what to do!”, so giving them a birds-eye view of their competition turns the problem into “do better than them!”.
  2. The FT in Play (Reuters) — very interesting point in this analysis of the Financial Times being up for sale: [Traditional] journalism doesn’t have economies of scale. The bigger that journalistic organizations become, the less efficient they get. (via Bernard Hickey)
  3. Big Data Behind Obama’s Win (Time) — huge analytics operation, very secretive, providing insights and updates on everything.
  4. How to Predict the FutureThis is the story of a spreadsheet I’ve been keeping for almost twenty years. Thesis: hardware trends more useful for predicting advances than software trends. (via Kenton Kivestu)

October 30 2012

Four short links: 30 October 2012

  1. Fastly’s S3 Latency MonitorThe graph represents real-time response latency for Amazon S3 as seen by Fastly’s Ashburn, VA edge server. I’ve been watching #sandy’s effect on the Internet in real-time, while listening to its effect on people in real-time. Amazing.
  2. Button Upgrade (Gizmodo) — elegant piece of button design, for sale on Shapeways.
  3. Inside a Dozen USB Chargers — amazing differences in such seemingly identical products. I love the comparison between genuine and counterfeit Apple chargers. (via Hacker News)
  4. Why Products Fail (Wired) — researcher scours the stock market filings of publicly-listed companies to extract information about warranties. Before, even information like the size of the market—how much gets paid out each year in warranty claims—was a mystery. Nobody, not analysts, not the government, not the companies themselves, knew what it was. Now Arnum can tell you. In 2011, for example, basic warranties cost US manufacturers $24.7 billion. Because of the slow economy, this is actually down, Arnum says; in 2007 it was around $28 billion. Extended warranties—warranties that customers purchase from a manufacturer or a retailer like Best Buy—account for an estimated $30.2 billion in additional claims payments. Before Arnum, this $60 billion-a-year industry was virtually invisible. Another hidden economy revealed. (via BoingBoing)

October 15 2012

Four short links: 15 October 2012

  1. Cheap Thermocam — cheap thermal imaging camera, takes about a minute to capture an image. (via IEEE Spectrum)
  2. Observations on What’s Getting Downvoted (Ars Technica) — fascinating piece of social work, showing how the community polices (or reacts to) trolls. (via Hacker News)
  3. Dark Social (The Atlantic) — Just look at that graph. On the one hand, you have all the social networks that you know. They’re about 43.5 percent of our social traffic. On the other, you have this previously unmeasured darknet that’s delivering 56.5 percent of people to individual stories. This is not a niche phenomenon! It’s more than 2.5x Facebook’s impact on the site.
  4. A Tethered WorldAll students, across all 56 represented countries, are doing generally the same few things. Facebook and Twitter, above all else, are the predominant tools for all information use among the participants. The predominance of these few tools are creating a homogenizing influence around the world.

June 15 2012

Top Stories: June 11-15, 2012

Here's a look at the top stories published across O'Reilly sites this week.

A reduced but important future for desktop computing
Josh Marinacci says most people will rely on mobile devices to handle their computing needs, but a select and small group of power users will continue to use desktop machines.

Big ethics for big data
"Ethics of Big Data" authors Kord Davis and Doug Patterson explore ownership, anonymization, privacy, and ways to evaluate and establish ethical data practices within an organization.

Stories over spreadsheets
Imagine a future where clear language supplants spreadsheets. In a recent interview, Narrative Science CTO Kris Hammond explained how we might get there.


Data in use from public health to personal fitness
Releasing public data can't fix the health care system by itself, but it provides tools as well as a model for data sharing.


What is DevOps?
NoOps, DevOps — no matter what you call it, operations won't go away. Ops experts and development teams will jointly evolve to meet the challenges of delivering reliable software to customers.


Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif. Save 20% on registration with the code RADAR20.

March 12 2012

O'Reilly Radar Show 3/12/12: Best data interviews from Strata California 2012

Below you'll find the script and associated links from the March 12, 2012 episode of O'Reilly Radar. An archive of past shows is available through O'Reilly Media's YouTube channel and you can subscribe to episodes of O'Reilly Radar via iTunes.



In this special edition of the Radar Show we're bringing you three of our best interviews from the 2012 Strata Conference in California.

First up is Hadoop creator Doug Cutting discussing the similarities between Linux and the big data world. [Interview begins 16 seconds in.]

In our second interview from Strata California, Max Gadney from After the Flood explains the benefits of video data graphics. [Begins at 7:04.]

In our final Strata CA interview, Kaggle's Jeremy Howard looks at the difference between big data and analytics. [Begins at 13:46.]

Closing

Just a reminder that you can always catch episodes of O'Reilly Radar at youtube.com/oreillymedia and subscribe to episodes through iTunes.

All of the links and resources mentioned during this episode are posted at radar.oreilly.com/show.

That's all we have for this episode. Thanks for joining us and we'll see you again soon.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Four short links: 12 March 2012

  1. Web-Scale User Modeling for Targeting (Yahoo! Research, PDF) -- research paper that shows how online advertisers build profiles of us and what matters (e.g., ads we buy from are more important than those we simply click on). Our recent surfing patterns are more relevant than historical ones, which is another indication that value of data analytics increases the closer to real-time it happens. (via Greg Linden)
  2. Information Technology and Economic Change -- research showing that cities which adopted the printing press no prior growth advantage, but subsequently grew far faster than similar cities without printing presses. [...] The second factor behind the localisation of spillovers is intriguing given contemporary questions about the impact of information technology. The printing press made it cheaper to transmit ideas over distance, but it also fostered important face-to-face interactions. The printer’s workshop brought scholars, merchants, craftsmen, and mechanics together for the first time in a commercial environment, eroding a pre-existing “town and gown” divide.
  3. They Just Don't Get It (Cameron Neylon) -- curating access to a digital collection does not scale.
  4. Should Libraries Get Out of the Ebook Business? -- provocative thought: the ebook industry is nascent, a small number of patrons have ereaders, the technical pain of DRM and incompatible formats makes for disproportionate support costs, and there are already plenty of worthy things libraries should be doing. I only wonder how quickly the dynamics change: a minority may have dedicated ereaders but a large number have smartphones and are reading on them already.

March 09 2012

Four short links: 9 March 2012

  1. Why The Symphony Needs A Progress Bar (Elaine Wherry) -- an excellent interaction designer tackles the real world.
  2. Biologic -- view your social network as though looking at cells through a microscope. Gorgeous and different.
  3. The Cost of Cracking -- analysis of used phone listings to see what improves and decreases price yields some really interesting results. Phones described as “decent” are typically priced 23% below the median. Who would describe something they’re selling as "decent" and price it below market value unless something fishy was going on? [...] On average, cracking your phone destroys 30-50% of its value instantly. Particularly interesting to me since Ms 10 just brought home her phone with *cough* a new starburst screensaver.
  4. OpenStreetMap Welcomes Apple -- this is the classy way to deal with the world's richest company quietly and badly using your work without acknowledgement.

March 08 2012

Strata Week: Profiling data journalists

Here are a few of the data stories that caught my attention this week.

Profiling data journalists

Over the past week, O'Reilly's Alex Howard has profiled a number of practicing data journalists, following up on the National Institute for Computer-Assisted Reporting's (NICAR) 2012 conference. Howard argues that data journalism has enormous importance, but "given the reality that those practicing data journalism remain a tiny percentage of the world's media, there's clearly still a need for its foremost practitioners to show why it matters, in terms of impact."

Howard's profiles include:

Strata Santa Clara 2012 Complete Video Compilation
The Strata video compilation includes workshops, sessions and keynotes from the 2012 Strata Conference in Santa Clara, Calif. Learn more and order here.

Surveying data marketplaces

Edd Dumbill takes a look at data marketplaces, the online platforms that host data from various publishers and offer it for sale to consumers. Dumbill compares four of the most mature data marketplaces — Infochimps, Factual, Windows Azure Data Marketplace, and DataMarket — and examines their different approaches and offerings.

Dumbill says marketplaces like these are useful in three ways:

"First, they provide a point of discoverability and comparison for data, along with indicators of quality and scope. Second, they handle the cleaning and formatting of the data, so it is ready for use (often 80% of the work in any data integration). Finally, marketplaces provide an economic model for broad access to data that would otherwise prove difficult to either publish or consume."

Analyzing sports stats

The Atlantic's Dashiell Bennett examines the MIT Sloan Sports Analytics Conference, a "festival of sports statistics" that has grown over the past six years from 175 attendees to more than 2,200.

Bennett writes:

"For a sports conference, the event is noticeably athlete-free. While a couple of token pros do occasionally appear as panel guests, this is about the people behind the scenes — those who are trying to figure out how to pick those athletes for their team, how to use them on the field, and how much to pay them without looking like a fool. General managers and team owners are the stars of this show ... The difference between them and the CEOs of most companies is that the sports guys have better data about their employees ... and a lot of their customers have it memorized."

Got data news?

Feel free to email me.

Related:

March 06 2012

Four short links: 6 March 2012

  1. SoupHub -- NZ project putting a computer with Internet access (and instruction and help) into a soup kitchen. I can't take any credit for it, but I'm delighted beyond measure that the idea for this was hatched at Kiwi Foo Camp. I love that my peeps are doing stuff that matters. (See also the newspaper writeup)
  2. Bandwidth of Pages -- view a 140 character tweet on the web and you're load 2MB of, well, let's call it crap.
  3. On The Reductionism of Analytics in Education (Anne Zelenka) -- Learning analytics, as practiced today, is reductionist to an extreme. We are reducing too many dimensions into too few. More than that, we are describing and analyzing only those things that we can describe and analyze, when what matters exists at a totally different level and complexity. We are missing emergent properties of educational and learning processes by focusing on the few things we can measure and by trying to automate what decisions and actions might be automated. A fantastic post, which coins the phrase "the math is not the territory".
  4. Quotes Worth Spreading (Karl Fisch) -- collection of thought-provoking quotes from recent TED talks. Be generous by graciously accepting compliments. It's a gift you give the complimenter (John Bates) is something I'm particularly working on.

February 24 2012

Four short links: 24 February 2012

  1. Excel Cloud Data Analytics (Microsoft Research) -- clever--a cloud analytics backend with Excel as the frontend. Almost every business and finance person I've known has been way more comfortable with Excel than any other tool. (via Dr Data)
  2. HTTP Client -- Mac OS X app for inspecting and automating a lot of HTTP. cf the lovely Charles proxy for debugging. (via Nelson Minar)
  3. The Creative Destruction of Medicine -- using big data, gadgets, and sweet tech in general to personalize and improve healthcare. (via New York Times)
  4. EFF Wins Protection of Time Zone Database (EFF) -- I posted about the silliness before (maintainers of the only comprehensive database of time zones was being threatened by astrologers). The EFF stepped in, beat back the buffoons, and now we're back to being responsible when we screw up timezones for phone calls.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl