Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

October 04 2013

Four short links: 4 October 2013

  1. Case and Molly, a Game Inspired by Neuromancer (Greg Borenstein) — On reading Neuromancer today, this dynamic feels all too familiar. We constantly navigate the tension between the physical and the digital in a state of continuous partial attention. We try to walk down the street while sending text messages or looking up GPS directions. We mix focused work with a stream of instant message and social media conversations. We dive into the sudden and remote intimacy of seeing a family member’s face appear on FaceTime or Google Hangout. “Case and Molly” uses the mechanics and aesthetics of Neuromancer’s account of cyberspace/meatspace coordination to explore this dynamic.
  2. Rethinking Ray Ozziean inescapable conclusion: Ray Ozzie was right. And Microsoft’s senior leadership did not listen, certainly not at the time, and perhaps not until it was too late. Hear, hear!
  3. Recursive Deep Models for Semantic Compositionality
    Over a Sentiment Treebank
    (PDF) — apparently it nails sentiment analysis, and will be “open sourced”. At least, according to this GigaOm piece, which also explains how it works.
  4. PLoS ASAP Award Finalists Announced — with pointers to interviews with the finalists, doing open access good work like disambiguating species names and doing open source drug discovery.

October 02 2013

Four short links: 2 October 2013

  1. Instant Translator Glasses (ZDNet) — character recognition to do instant translating, and a UI that turns any flat surface into a touch-screen via a finger-ring sensor.
  2. draw.io — diagramming … In The Cloud!
  3. Airmail — Mac gmail client with offline mode that fails to suck.
  4. The Page-Fault Weird Machine: Lessons in Instruction-less Computation (Usenix) — video, audio, and text of a paper that’ll make your head hurt. We demonstrate a Turing-complete execution environment driven solely by the IA32 architecture’s interrupt handling and memory translation tables, in which the processor is trapped in a series of page faults and double faults, without ever successfully dispatching any instructions. LOLWUT?!

September 24 2013

Four short links: 30 September 2013

  1. Steve Yegge on GROK (YouTube) — The Grok Project is an internal Google initiative to simplify the navigation and querying of very large program source repositories. We have designed and implemented a language-neutral, canonical representation for source code and compiler metadata. Our data production pipeline runs compiler clusters over all Google’s code and third-party code, extracting syntactic and semantic information. The data is then indexed and served to a wide variety of clients with specialized needs. The entire ecosystem is evolving into an extensible platform that permits languages, tools, clients and build systems to interoperate in well-defined, standardized protocols.
  2. Deep Learning for Semantic AnalysisWhen trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.
  3. Fireshell — workflow tools and framework for front-end developers.
  4. SICP.js — lots of Structure and Interpretation of Computer Programs (the canonical text for higher-order programming) ported to Javascript.

September 16 2013

Four short links: 16 September 2013

  1. UAV Offers of Assistance in Colorado Rebuffed by FEMAwe were told by FEMA that anyone flying drones would be arrested. [...] Civil Air Patrol and private aircraft were authorized to fly over the small town tucked into the base of Rockies. Unfortunately due to the high terrain around Lyons and large turn radius of manned aircraft they were flying well out of a useful visual range and didn’t employ cameras or live video feed to support the recovery effort. Meanwhile we were grounded on the Lyons high school football field with two Falcons that could have mapped the entire town in less than 30 minutes with another few hours to process the data providing a near real time map of the entire town.
  2. Texas Bans Some Private Use of Drones (DIY Drones) — growing move for govt to regulate drones.
  3. IETF PRISM-Proof Plans (Parity News) — Baker starts off by listing out the attack degree including he likes of information / content disclosure, meta-data analysis, traffic analysis, denial of service attacks and protocol exploits. The author than describes the different capabilities of an attacker and the ways in which an attack can be carried out – passive observation, active modification, cryptanalysis, cover channel analysis, lawful interception, Subversion or Coercion of Intermediaries among others.
  4. Data Mining and Analysis: Fundamental Concepts and Algorithms (PDF) — 650 pages on cluster, sequence mining, SVNs, and more. (via author’s page)

September 08 2013

August 23 2013

Four short links: 23 August 2013

  1. Bradley Manning and the Two Americas (Quinn Norton) — The first America built the Internet, but the second America moved onto it. And they both think they own the place now. The best explanation you’ll find for wtf is going on.
  2. Staggering Cost of Inventing New Drugs (Forbes) — $5BB to develop a new drug; and subject to an inverse-Moore’s law: A 2012 article in Nature Reviews Drug Discovery says the number of drugs invented per billion dollars of R&D invested has been cut in half every nine years for half a century.
  3. Who’s Watching You — (Tim Bray) threat modelling. Everyone should know this.
  4. Data Mining with Weka — learn data mining with the popular open source Weka platform.

August 20 2013

Four short links: 20 August 2013

  1. pineapple.io — attempt to crowdsource rankings for tutorials for important products, so you’re not picking your way through Google search results littered with tutorials written by incompetent illiterates for past versions of the software.
  2. BBC ForumAmerican social psychologist Aleks Krotoski has been looking at how the internet affects the way we talk to ourselves. Podcast (available for next 30 days) from BBC. (via Vaughan Bell)
  3. Why Can’t My Computer Understand Me (New Yorker) — using anaphora as the basis of an intelligence test, as example of what AI should be striving for. It’s not just that contemporary A.I. hasn’t solved these kinds of problems yet; it’s that contemporary A.I. has largely forgotten about them. In Levesque’s view, the field of artificial intelligence has fallen into a trap of “serial silver bulletism,” always looking to the next big thing, whether it’s expert systems or Big Data, but never painstakingly analyzing all of the subtle and deep knowledge that ordinary human beings possess. That’s a gargantuan task— “more like scaling a mountain than shoveling a driveway,” as Levesque writes. But it’s what the field needs to do.
  4. 507 Mechanical Movements — an old basic engineering textbook, animated. Me gusta.

June 17 2013

Four short links: 17 June 2013

  1. Weekend Reads on Deep Learning (Alex Dong) — an article and two videos unpacking “deep learning” such as multilayer neural networks.
  2. The Internet of Actual Things“I have 10 reliable activations remaining,” your bulb will report via some ridiculous light-bulbs app on your phone. “Now just nine. Remember me when I’m gone.” (via Andy Baio)
  3. Announcing the Mozilla Science Lab (Kaitlin Thaney) — We also want to find ways of supporting and innovating with the research community – building bridges between projects, running experiments of our own, and building community. We have an initial idea of where to start, but want to start an open dialogue to figure out together how to best do that, and where we can be of most value..
  4. NAND to TetrisThe site contains all the software tools and project materials necessary to build a general-purpose computer system from the ground up. We also provide a set of lectures designed to support a typical course on the subject. (via Hacker News)

June 07 2013

Four short links: 7 June 2013

  1. Accumulo — NSA’s BigTable implementation, released as an Apache project.
  2. How the Robots Lost (Business Week) — the decline of high-frequency trading profits (basically, markets worked and imbalances in speed and knowledge have been corrected). Notable for the regulators getting access to the technology that the traders had: Last fall the SEC said it would pay Tradeworx, a high-frequency trading firm, $2.5 million to use its data collection system as the basic platform for a new surveillance operation. Code-named Midas (Market Information Data Analytics System), it scours the market for data from all 13 public exchanges. Midas went live in February. The SEC can now detect anomalous situations in the market, such as a trader spamming an exchange with thousands of fake orders, before they show up on blogs like Nanex and ZeroHedge. If Midas sees something odd, Berman’s team can look at trading data on a deeper level, millisecond by millisecond.
  3. PRISM: Surprised? (Danny O’Brien) — I really don’t agree with the people who think “We don’t have the collective will”, as though there’s some magical way things got done in the past when everyone was in accord and surprised all the time. It’s always hard work to change the world. Endless, dull hard work. Ten years later, when you’ve freed the slaves or beat the Nazis everyone is like “WHY CAN’T IT BE AS EASY TO CHANGE THIS AS THAT WAS, BACK IN THE GOOD OLD DAYS. I GUESS WE’RE ALL JUST SHEEPLE THESE DAYS.”
  4. What We Don’t Know About Spying on Citizens is Scarier Than What We Do Know (Bruce Schneier) — The U.S. government is on a secrecy binge. It overclassifies more information than ever. And we learn, again and again, that our government regularly classifies things not because they need to be secret, but because their release would be embarrassing. Open source BigTable implementation: free. Data gathering operation around it: $20M/year. Irony in having the extent of authoritarian Big Brother government secrecy questioned just as a whistleblower’s military trial is held “off the record”: priceless.

June 04 2013

Four short links: 4 June 2013

  1. WeevilScout — browser app that turns your browser into a worker for distributed computation tasks. See the poster (PDF). (via Ben Lorica)
  2. sregex (Github) — A non-backtracking regex engine library for large data streams. See also slide notes from a YAPC::NA talk. (via Ivan Ristic)
  3. Bobby Tables — a guide to preventing SQL injections. (via Andy Lester)
  4. Deep Learning Using Support Vector Machines (Arxiv) — we are proposing to train all layers of the deep networks by backpropagating gradients through the top level SVM, learning features of all layers. Our experiments show that simply replacing softmax with linear SVMs gives significant gains on datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop’s face expression recognition challenge. (via Oliver Grisel)

May 16 2013

Four short links: 16 May 2013

  1. Australian Filter Scope CreepThe Federal Government has confirmed its financial regulator has started requiring Australian Internet service providers to block websites suspected of providing fraudulent financial opportunities, in a move which appears to also open the door for other government agencies to unilaterally block sites they deem questionable in their own portfolios.
  2. Embedding Actions in Gmail — after years of benign neglect, it’s good to see Gmail worked on again. We’ve said for years that email’s a fertile ground for doing stuff better, and Google seem to have the religion. (see Send Money with Gmail for more).
  3. What Keeps Me Up at Night (Matt Webb) — Matt’s building a business around connected devices. Here he explains why the category could be owned by any of the big players. In times like this I remember Howard Aiken’s advice: Don’t worry about people stealing your ideas. If it is original you will have to ram it down their throats.
  4. Image Texture Predicts Avian Density and Species Richness (PLOSone) — Surprisingly and interestingly, remotely sensed vegetation structure measures (i.e., image texture) were often better predictors of avian density and species richness than field-measured vegetation structure, and thus show promise as a valuable tool for mapping habitat quality and characterizing biodiversity across broad areas.

May 15 2013

Four short links: 15 May 2013

  1. Facial Recognition in Google Glass (Mashable) — this makes Glass umpty more attractive to me. It was created in a hackathon for doctors to use with patients, but I need it wired into my eyeballs.
  2. How to Price Your Hardware ProjectAt the end of the day you are picking a price that enables you to stay in business. As @meganauman says, “Profit is not something to add at the end, it is something to plan for in the beginning.”
  3. Hardware Pricing (Matt Webb) — When products connect to the cloud, the cost structure changes once again. On the one hand, there are ongoing network costs which have to be paid by someone. You can do that with a cut of transactions on the platform, by absorbing the network cost upfront in the RRP, or with user-pays subscription.
  4. Dicoogle — open source medical image search. Written up in PLOSone paper.

April 22 2013

A different take on data skepticism

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …

There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

So, which methods are better in this regard? In general, it’s those that explore model space in addition to model parameters. In the case of k-means, for example, this would mean learning the number k in addition to the cluster assignment for each data point. For k-nearest neighbors, we could learn the number of exemplars to use and also the distance metric that provides the best explanation for the data. This multi-level approach might sound advanced, and it is true that these implementations are more complex. But complexity of implementation needn’t correlate with “danger” (thanks in part to software engineering), and it’s certainly not a sufficient reason to dismiss more robust methods.

I find the database analogy useful here: developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems — i.e., the “professionals.” How? Well, decades of experience — and lots of trial and error — have yielded good abstractions in this area. As a result, we can meaningfully talk about the database “layer” in our overall “stack.” Of course, these abstractions are leaky, like all others, and there are plenty of sharp edges remaining (and, some might argue, more being created every day with the explosion of NoSQL solutions). Nevertheless, my weekend-project webapp can store and query insane amounts of data — and I have no idea how to implement a B-tree.

For ML to have a similarly broad impact, I think the tools need to follow a similar path. We need to push ourselves away from the viewpoint that sees ML methods as a bag of tricks, with the right method chosen on a per-problem basis, success requiring a good deal of art, and evaluation mainly by artificial measures of accuracy at the expense of other considerations. Trustworthiness, robustness, and conservatism are just as important, and will have far more influence on the long-run impact of ML.

Will well-intentioned people still be able to lie to themselves? Sure, of course! Let alone the greedy or malicious actors that Cathy and Mike are also concerned about. But our tools should make the common cases easy and safe, and that’s not the reality today.

Related:

April 11 2013

Data skepticism

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

April 09 2013

Four short links: 9 April 2013

  1. Automated Essay Grading To Come to EdX (NY Times) — shortly after we get software that writes stories for us, we get software to read them for us.
  2. AMD Calls End of Moore’s Law in Ten Years (ComputerWorld) — story based on this video, where Michio Kaku lays out the timeline for Moore’s Law’s wind-down and the spin-up of new technology.
  3. Addressing Human Trafficking Through Technology (danah boyd) — technologists love to make tech and then assert it’ll help people. Danah’s work on teens and now trafficking steers us to do what works, rather than what is showy or easiest.
  4. Product Management (Rowan Simpson) — hand this to anyone who asks what product management actually is. Excellent explanation.

April 01 2013

Four short links: 1 April 2013

  1. MLDemosan open-source visualization tool for machine learning algorithms created to help studying and understanding how several algorithms function and how their parameters affect and modify the results in problems of classification, regression, clustering, dimensionality reduction, dynamical systems and reward maximization. (via Mark Alen)
  2. kiln (GitHub) — open source extensible on-device debugging framework for iOS apps.
  3. Industrial Internet — the O’Reilly report on the industrial Internet of things is out. Prasad suggests an illustration: for every car with a rain sensor today, there are more than 10 that don’t have one. Instead of an optical sensor that turns on windshield wipers when it sees water, imagine the human in the car as a sensor — probably somewhat more discerning than the optical sensor in knowing what wiper setting is appropriate. A car could broadcast its wiper setting, along with its location, to the cloud. “Now you’ve got what you might call a rain API — two machines talking, mediated by a human being,” says Prasad. It could alert other cars to the presence of rain, perhaps switching on headlights automatically or changing the assumptions that nearby cars make about road traction.
  4. Unique in the Crowd: The Privacy Bounds of Human Mobility (PDF, Nature) — We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals. As Edd observed, “You are a unique snowflake, after all.” (via Alasdair Allan)

March 25 2013

Four short links: 25 March 2013

  1. Analytics for LearningSince doing good learning analytics is hard, we often do easy learning analytics and pretend that they are good instead. But pretending doesn’t make it so. (via Dan Meyer)
  2. Reproducible Research — a list of links to related work about reproducible research, reproducible research papers, etc. (via Stijn Debrouwere)
  3. Pentagon Deploying 100+ Cyber TeamsThe organization defending military networks — cyber protection forces — will comprise more than 60 teams, a Pentagon official said. The other two organizations — combat mission forces and national mission forces — will conduct offensive operations. I’ll repeat that: offensive operations.
  4. Towards Deterministic Compressed Sensing (PDF) — instead of taking lots of data, compressing by throwing some away, can we only take a few samples and reconstruct the original from that? (more mathematically sound than my handwaving explanation). See also Compressed sensing and big data from the Practical Quant. (via Ben Lorica)

March 11 2013

Four short links: 11 March 2013

  1. Adventures in the Ransom Trade — between insurance, protection, and ransoms, Sean Gourley describes it as “one of the more interesting grey markets.” (via Sean Gourley)
  2. About High School Computer Science Teachers (Selena Deckelmann) — Selena gets an education in the state of high school computer science education.
  3. Learning From Big Data (Google Research) — the Wikilinks Corpus: 40 million total disambiguated mentions within over 10 million web pages [...] The mentions are found by looking for links to Wikipedia pages where the anchor text of the link closely matches the title of the target Wikipedia page. If we think of each page on Wikipedia as an entity (an idea we’ve discussed before), then the anchor text can be thought of as a mention of the corresponding entity.
  4. Teens Have Always Gone Where Identity Isn’tif you look back at one of the first dominant social platforms, AOL Instant Messenger, it looks a lot like the pseudonymous Tumblr and Snapchat of today in many respects. You used an avatar that was not your face. Your screenname was not indexed and not personally identifiable (mine was Goober1310).

March 08 2013

Four short links: 8 March 2013

  1. mlcompa free website for objectively comparing machine learning programs across various datasets for multiple problem domains.
  2. Printing Code: Programming and the Visual Arts (Vimeo) — Rune Madsen’s talk from Heroku’s Waza. (via Andrew Odewahn)
  3. What Data Brokers Know About You (ProPublica) — excellent run-down on the compilers of big data about us. Where are they getting all this info? The stores where you shop sell it to them.
  4. Subjective Impressions Do Not Mirror Online Reading Effort: Concurrent EEG-Eyetracking Evidence from the Reading of Books and Digital Media (PLOSone) — Comprehension accuracy did not differ across the three media for either group and EEG and eye fixations were the same. Yet readers stated they preferred paper. That preference, the authors conclude, isn’t because it’s less readable. From this perspective, the subjective ratings of our participants (and those in previous studies) may be viewed as attitudes within a period of cultural change.
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl