Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 08 2014

How did we end up with a centralized Internet for the NSA to mine?

I’m sure it was a Wired editor, and not the author Steven Levy, who assigned the title “How the NSA Almost Killed the Internet” to yesterday’s fine article about the pressures on large social networking sites. Whoever chose the title, it’s justifiably grandiose because to many people, yes, companies such as Facebook and Google constitute what they know as the Internet. (The article also discusses threats to divide the Internet infrastructure into national segments, which I’ll touch on later.)

So my question today is: How did we get such industry concentration? Why is a network famously based on distributed processing, routing, and peer connections characterized now by a few choke points that the NSA can skim at its leisure?

I commented as far back as 2006 that industry concentration makes surveillance easier. I pointed out then that the NSA could elicit a level of cooperation (and secrecy) from the likes of Verizon and AT&T that it would never get in the US of the 1990s, where Internet service was provided by thousands of mom-and-pop operations like Brett Glass’s wireless service in Laramie, Wyoming. Things are even more concentrated now, in services if not infrastructure.

Having lived through the Boston Marathon bombing, I understand what the NSA claims to be fighting, and I am willing to seek some compromise between their needs for spooking and the protections of the Fourth Amendment to the US Constitution. But as many people have pointed out, the dangers of centralized data storage go beyond the NSA. Bruce Schneier just published a pretty comprehensive look at how weak privacy leads to a weakened society. Others jeer that if social networking companies weren’t forced to give governments data, they’d be doing just as much snooping on their own to raise the click rates on advertising. And perhaps our more precious, closely held data — personal health information — is constantly subject to a marketplace for data mining.

Let’s look at the elements that make up the various layers of hardware and software we refer to casually as the Internet. How does centralization and decentralization work for each?

Public routers

One of Snowden’s major leaks reveals that the NSA pulled a trick comparable to the Great Firewall of China, tracking traffic as it passes through major routers across national borders. Like many countries that censor traffic, in other words, the NSA capitalized on the centralization of international traffic.

Internet routing within the US has gotten more concentrated over the years. There were always different “tiers” of providers, who all did basically the same thing but at inequitable prices. Small providers always complained about the fees extracted by Tier 1 networks. A Tier 1 network can transmit its own traffic nearly anywhere it needs to go for just the cost of equipment, electricity, etc., while extracting profit from smaller networks that need its transport. So concentration in the routing industry is a classic economy of scale.

International routers, of the type targeted by the NSA and many US governments, are even more concentrated. African and Latin American ISPs historically complained about having to go through US or European routers even if the traffic just came back to their same continent. (See, for instance, section IV of this research paper.) This raised the costs of Internet use in developing countries.

The reliance of developing countries on outside routers stems from another simple economic truth: there are more routers in affluent countries for the same reason there are more shopping malls or hospitals in affluent countries. Foreigners who have trespassed US laws can be caught if they dare to visit a shopping mall or hospital in the US. By the same token, their traffic can be grabbed by the NSA as it travels to a router in the US, or one of the other countries where the NSA has established a foothold. It doesn’t help that the most common method of choosing routes, the Border Gateway Protocol (BGP), is a very old Internet standard with no concept of built-in security.

The solution is economic: more international routers to offload traffic from the MAE-Wests and MAE-Easts of the world. While opposing suggestions to “balkanize” the Internet, we can applaud efforts to increase connectivity through more routers and peering.

IaaS cloud computing

Centralization has taken place at another level of the Internet: storage and computing. Data is theoretically safe from intruders in the cloud so long as encryption is used both in storage and during transmission — but of course, the NSA thought of that problem long ago, just as they thought of everything. So use encryption, but don’t depend on it.

Movement to the cloud is irreversible, so the question to ask is how free and decentralized the cloud can be. Private networks can be built on virtualization solutions such as the proprietary VMware and Azure or the open source OpenStack and Eucalyptus. The more providers there are, the harder it will be to do massive data collection.

SaaS cloud computing

The biggest change — what I might even term the biggest distortion — in the Internet over the past couple decades has been the centralization of content. Ironically, more and more content is being produced by individuals and small Internet users, but it is stored on commercial services, where it forms a tempting target for corporate advertisers and malicious intruders alike. Some people have seriously suggested that we treat the major Internet providers as public utilities (which would make them pretty big white elephants to unload when the next big thing comes along).

This was not technologically inevitable. Attempts at peer-to-peer social networking go back to the late 1990s with Jabber (now the widely used XMPP standard), which promised a distributed version of the leading Internet communications medium of the time: instant messaging. Diaspora more recently revived the idea in the context of Facebook-style social networking.

These services allow many independent people to maintain servers, offering the service in question to clients while connecting where necessary. Such an architecture could improve overall reliability because the failure of an individual server would be noticed only by people trying to communicate with it. The architecture would also be pretty snoop-proof, too.

Why hasn’t the decentralized model taken off? I blame SaaS. The epoch of concentration in social media coincides with the shift of attention from free software to SaaS as a way of delivering software. SaaS makes it easier to form a business around software (while the companies can still contribute to free software). So developers have moved to SaaS-based businesses and built new DevOps development and deployment practices around that model.

To be sure, in the age of the web browser, accessing a SaaS service is easier than fussing with free software. To champion distributed architectures such as Jabber and Diaspora, free software developers will have to invest as much effort into the deployment of individual servers as SaaS developers have invested in their models. Business models don’t seem to support that investment. Perhaps a concern for privacy will.

October 22 2013

Mining the social web, again

When we first published Mining the Social Web, I thought it was one of the most important books I worked on that year. Now that we’re publishing a second edition (which I didn’t work on), I find that I agree with myself. With this new edition, Mining the Social Web is more important than ever.

While we’re seeing more and more cynicism about the value of data, and particularly “big data,” that cynicism isn’t shared by most people who actually work with data. Data has undoubtedly been overhyped and oversold, but the best way to arm yourself against the hype machine is to start working with data yourself, to find out what you can and can’t learn. And there’s no shortage of data around. Everything we do leaves a cloud of data behind it: Twitter, Facebook, Google+ — to say nothing of the thousands of other social sites out there, such as Pinterest, Yelp, Foursquare, you name it. Google is doing a great job of mining your data for value. Why shouldn’t you?

There are few better ways to learn about mining social data than by starting with Twitter; Twitter is really a ready-made laboratory for the new data scientist. And this book is without a doubt the best and most thorough approach to mining Twitter data out there. But that’s only a starting point. We hear a lot in the press about sentiment analysis and mining unstructured text data; this book shows you how to do it. If you need to mine the data in web pages or email archives, this book shows you how. And if you want to understand how to people collaborate on projects, Mining the Social Web is the only place I’ve seen that analyzes GitHub data.

All of the examples in the book are available on Github. In addition to the example code, which is bundled into IPython notebooks, Matthew has provided a VirtualBox VM that installs Python, all the libraries you need to run the examples, the examples themselves, and an IPython server. Checking out the examples is as simple as installing Virtual Box, installing Vagrant, cloning the 2nd edition’s Github archive, and typing “vagrant up.” (This quick start guide summarizes all of that.) You can execute the examples for yourself in the virtual machine; modify them; and use the virtual machine for your own projects, since it’s a fully functional Linux system with Python, Java, MongoDB, and other necessities pre-installed. You can view this as a book with accompanying examples in a particularly nice package, or you can view the book as “premium support” for an open source project that consists of the examples and the VM.

If you want to engage with the data that’s surrounding you, Mining the Social Web is the best place to start. Use it to learn, to experiment, and to build your own data projects.

September 03 2013

Demographics are dead: the new, technical face of marketing

Over the past five years, marketing has transformed from a primarily creative process into an increasingly data-driven discipline with strong technological underpinnings.

The central purpose of marketing hasn’t changed: brands still aim to tell a story, to emotionally connect with a prospective customer, with the goal of selling a product or service. But while the need to tell an interesting, authentic story has remained constant, customers and channels have fundamentally changed. Old Marketing took a spray-and-pray approach aimed at a broad, passive audience: agencies created demographic or psychographic profiles for theoretical consumers and broadcast ads on mass-consumption channels, such as television, print, and radio. “Targeting” was primarily about identifying high concentrations of a given consumer type in a geographic area.

The era of demographics is over. Advances in data mining have enabled marketers to develop highly specific profiles of customers at the individual level, using data drawn from actual personal behavior and consumption patterns. Now when a brand tells a story, it has the ability to tailor the narrative in such a way that each potential customer finds it relevant, personally. Users have become accustomed to this kind of sophisticated targeting; broad-spectrum advertising on the Internet is now essentially spam. At the same time, there is still a fine line between “well-targeted” and “creepy.”

As data availability has improved the quality of potential opportunities, perpetual connectivity has increased the quantity. The proliferation of mobile devices, decreasing cost of data access, and popularity of social platforms have increased the number of avenues that brands have at their disposal. However, these factors have also made it harder for them to actually connect; today’s audience has an ever-shorter attention span, as focus is split across an increasing number of channels. Consumers are better informed than ever before; a price- or fact-check is simply a Google search away, whether users are in front of their TV screens or walking around stores.

This changing ecosystem has already resulted in a shift from transaction-oriented “product-centric marketing” to a relationship-focused “human-centric” approach. Consumer loyalty to brands is a thing of the past; brands now demonstrate loyalty to their customers. Marketing campaigns must be time-, place-, and context-aware. It’s also important to be technologically on point: cross-device, multi-channel, constantly measuring results and refining campaigns. Today’s technology has largely mitigated the Wanamaker problem, but a Goldilocks problem remains: consumers don’t want to be inundated with marketing that is too obnoxiously broad or too creepily specific. Consumers want relevance.

The Radar team is interested in the progression of “reaching” passive audiences to “engaging” active ones, and in the technology that enables brands to break through the noise and find the people who are most likely to love them. In upcoming months, we’ll be investigating the continuing evolution of marketing, with a focus on fusing the art of storytelling with the promise of data and the objectivity of analytics. If you’re excited about this field, or know of someone doing interesting work in the area, let us know — drop a note in the comments section, or ping me via email or Twitter.

August 15 2013

How A 'Deviant' Philosopher Built Palantir, A CIA-Funded Data-Mining Juggernaut - Forbes

How A ’Deviant’ Philosopher Built #Palantir, A CIA-Funded #Data-Mining Juggernaut - Forbes
http://www.forbes.com/sites/andygreenberg/2013/08/14/agent-of-intelligence-how-a-deviant-philosopher-built-palantir-a-cia-funded-

The biggest problem for Palantir’s business may be just how well its software works: It helps its customers see too much. In the wake of NSA leaker Edward Snowden’s revelations of the agency’s mass surveillance, Palantir’s tools have come to represent privacy advocates’ greatest fears of data-mining technology — Google-level engineering applied directly to government spying. That combination of Big Brother and Big Data has come into focus just as Palantir is emerging as one of the fastest-growing startups in the Valley, threatening to contaminate its first public impressions and render the firm toxic in the eyes of customers and investors just when it needs them most.

“They’re in a scary business,” says Electronic Frontier Foundation attorney Lee Tien. ACLU analyst Jay Stanley has written that Palantir’s software could enable a “true totalitarian nightmare, monitoring the activities of innocent Americans on a mass scale.”

Karp, a social theory Ph.D., doesn’t dodge those concerns. He sees Palantir as the company that can rewrite the rules of the zero-sum game of privacy and security. “I didn’t sign up for the government to know when I smoke a joint or have an affair,” he acknowledges. In a company address he stated, “We have to find places that we protect away from government so that we can all be the unique and interesting and, in my case, somewhat deviant people we’d like to be.”

Palantir has explored work in Saudi Arabia despite the staff’s misgivings about human rights abuses in the kingdom. And for all Karp’s emphasis on values, his apology for the WikiLeaks affair also doesn’t seem to have left much of an impression in his memory. In his address to Palantir engineers in July he sounded defiant: “We’ve never had a scandal that was really our fault.”

AT 4:07 P.M. ON NOV. 14, 2009 Michael Katz-Lacabe was parking his red Toyota Prius in the driveway of his home in the quiet Oakland suburb of San Leandro when a police car drove past. A license plate camera mounted on the squad car silently and routinely snapped a photo of the scene: his off-white, single-floor house, his wilted lawn and rosebushes, and his 5- and 8-year-old daughters jumping out of the car.

Katz-Lacabe, a gray-bearded and shaggy-haired member of the local school board, community activist and blogger, saw the photo only a year later: In 2010 he learned about the San Leandro Police Department’s automatic license plate readers, designed to constantly photograph and track the movements of every car in the city. He filed a public records request for any images that included either of his two cars. The police sent back 112 photos. He found the one of his children most disturbing.

“Who knows how many other people’s kids are captured in these images?” he asks. His concerns go beyond a mere sense of parental protection. “With this technology you can wind back the clock and see where everyone is, if they were parked at the house of someone other than their wife, a medical marijuana clinic, a Planned Parenthood center, a protest.”

... it’s clear that #Alex_Karp does indeed value privacy–his own.

His office, decorated with cardboard effigies of himself built by Palantir staff and a Lego fortress on a coffee table, overlooks Palo Alto’s Alma Street through two-way mirrors. Each pane is fitted with a wired device resembling a white hockey puck. The gadgets, known as acoustic transducers, imperceptibly vibrate the glass with white noise to prevent eavesdropping techniques, such as bouncing lasers off windows to listen to conversations inside.

He’s reminiscing about a more carefree time in his life–years before Palantir–and has put down his Rubik’s cube to better gesticulate. “I had $40,000 in the bank, and no one knew who I was. I loved it. I loved it. I just loved it. I just loved it!” he says, his voice rising and his hands waving above his head. “I would walk around, go into skanky places in Berlin all night. I’d talk to whoever would talk to me, occasionally go home with people, as often as I could. I went to places where people were doing things, smoking things. I just loved it.”

“One of the things I find really hard and view as a massive drag … is that I’m losing my ability to be completely anonymous.”

It’s not easy for a man in Karp’s position to be a deviant in the modern world. And with tools like Palantir in the hands of the government, deviance may not be easy for the rest of us, either. With or without safeguards, the “complete anonymity” Karp savors may be a 20th-century luxury.

Karp lowers his arms, and the enthusiasm drains from his voice: “I have to get over this.”

#surveillance, la préservation de la #vie_privée consideree comme un #luxe

December 05 2012

Four short links: 5 December 2012

  1. The Benefits of Poetry for Professionals (HBR) — Harman Industries founder Sidney Harman once told The New York Times, “I used to tell my senior staff to get me poets as managers. Poets are our original systems thinkers. They look at our most complex environments and they reduce the complexity to something they begin to understand.”
  2. First Few Milliseconds of an HTTPS Connection — far more than you ever wanted to know about how HTTPS connections are initiated.
  3. Google Earth EngineDevelop, access and run algorithms on the full Earth Engine data archive, all using Google’s parallel processing platform. (via Nelson Minar)
  4. 3D Printing Popup Store Opens in NYC (Makezine Blog) — MAKE has partnered with 3DEA, a pop up 3D printing emporium in New York City’s fashion district. The store will sell printers and 3D printed objects as well as offer a lineup of classes, workshops, and presentations from the likes of jewelry maker Kevin Wei, 3D printing artist Josh Harker, and Shapeways’ Duann Scott. This. is. awesome!

October 04 2012

Four short links: 4 October 2012

  1. As We May Think (Vannevar Bush) — incredibly prescient piece he wrote for The Atlantic in 1945.
  2. Transparency and Topic Models (YouTube) — a talk from DataGotham 2012, by Hanna Wallach. She uses latent Dirichlet allocation topic models to mine text data in declassified documents where the metadata are useless. She’s working on predicting classification durations (AWESOME!). (via Matt Biddulph)
  3. Slippy Map of the Ancient World — this. is. so. cool!
  4. Technology in the NFLX2IMPACT’s Concussion Management System (CMS) is a great example of this trend. CMS, when combined with a digital mouth guard, also made by X2, enables coaches to see head impact data in real-time and asses concussions through monitoring the accelerometers in a players mouth guard. That data helps teams to decide whether to keep a player on the field or take them off for their own safety. Insert referee joke here.

August 25 2012

Four short links: 30 August 2012

  1. TOS;DR — terms of service rendered comprehensible. “Make the hard stuff easy” is a great template for good ideas, and this just nails it.
  2. Sick of Impact Factorstypically only 15% of the papers in a journal account for half the total citations. Therefore only this minority of the articles has more than the average number of citations denoted by the journal impact factor. Take a moment to think about what that means: the vast majority of the journal’s papers — fully 85% — have fewer citations than the average. The impact factor is a statistically indefensible indicator of journal performance; it flatters to deceive, distributing credit that has been earned by only a small fraction of its published papers. (via Sci Blogs)
  3. A Generation Lost in the Bazaar (ACM) — Today’s Unix/Posix-like operating systems, even including IBM’s z/OS mainframe version, as seen with 1980 eyes are identical; yet the 31,085 lines of configure for libtool still check if and exist, even though the Unixen, which lacked them, had neither sufficient memory to execute libtool nor disks big enough for its 16-MB source code. [...] That is the sorry reality of the bazaar Raymond praised in his book: a pile of old festering hacks, endlessly copied and pasted by a clueless generation of IT “professionals” who wouldn’t recognize sound IT architecture if you hit them over the head with it. It is hard to believe today, but under this embarrassing mess lies the ruins of the beautiful cathedral of Unix, deservedly famous for its simplicity of design, its economy of features, and its elegance of execution. (Sic transit gloria mundi, etc.)
  4. History as Science (Nature) — Turchin and his allies contend that the time is ripe to revisit general laws, thanks to tools such as nonlinear mathematics, simulations that can model the interactions of thousands or millions of individuals at once, and informatics technologies for gathering and analysing huge databases of historical information.

August 23 2012

Four short links: 23 August 2012

  1. Computational Social Science (Nature) — Facebook and Twitter data drives social science analysis. (via Vaughan Bell)
  2. The Single Most Important Object in the Global Economy (Slate) — Companies like Ikea have literally designed products around pallets: Its “Bang” mug, notes Colin White in his book Strategic Management, has had three redesigns, each done not for aesthetics but to ensure that more mugs would fit on a pallet (not to mention in a customer’s cupboard). (via Boing Boing)
  3. Narco Ultralights (Wired) — it’s just a matter of time until there are no humans on the ultralights. Remote-controlled narcodrones can’t be far away.
  4. Shortcut Foo — a typing tutor for editors, photoshop, and the commandline, to build muscle memory of frequently-used keystrokes. Brilliant! (via Irene Ros)

August 15 2012

Mining the astronomical literature

There is a huge debate right now about making academic literature freely accessible and moving toward open access. But what would be possible if people stopped talking about it and just dug in and got on with it?

NASA’s Astrophysics Data System (ADS), hosted by the Smithsonian Astrophysical Observatory (SAO), has quietly been working away since the mid-’90s. Without much, if any, fanfare amongst the other disciplines, it has moved astronomers into a world where access to the literature is just a given. It’s something they don’t have to think about all that much.

The ADS service provides access to abstracts for virtually all of the astronomical literature. But it also provides access to the full text of more than half a million papers, going right back to the start of peer-reviewed journals in the 1800s. The service has links to online data archives, along with reference and citation information for each of the papers, and it’s all searchable and downloadable.

Number of papers published in the three main astronomy journals each year
Number of papers published in the three main astronomy journals each year. CREDIT: Robert Simpson

The existence of the ADS, along with the arXiv pre-print server, has meant that most astronomers haven’t seen the inside of a brick-built library since the late 1990s.

It also makes astronomy almost uniquely well placed for interesting data mining experiments, experiments that hint at what the rest of academia could do if they followed astronomy’s lead. The fact that the discipline’s literature has been scanned, archived, indexed and catalogued, and placed behind a RESTful API makes it a treasure trove, both for hypothesis generation and sociological research.

For example, the .Astronomy series of conferences is a small workshop that brings together the best and the brightest of the technical community: researchers, developers, educators and communicators. Billed as “20% time for astronomers,” it gives these people space to think about how the new technologies affect both how research and communicating research to their peers and to the public is done.

[Disclosure: I'm a member of the advisory board to the .Astronomy conference, and I previously served as a member of the programme organising committee for the conference series.]

It should perhaps come as little surprise that one of the more interesting projects to come out of a hack day held as part of this year’s .Astronomy meeting in Heidelberg was work by Robert Simpson, Karen Masters and Sarah Kendrew that focused on data mining the astronomical literature.

The team grabbed and processed the titles and abstracts of all the papers from the Astrophysical Journal (ApJ), Astronomy & Astrophysics (A&A), and the Monthly Notices of the Royal Astronomical Society (MNRAS) since each of those journals started publication — and that’s 1827 in the case of MNRAS.

By the end of the day, they’d found some interesting results showing how various terms have trended over time. The results were similar to what’s found in Google Books’ Ngram Viewer.

The relative popularity of the names of telescopes in the literature
The relative popularity of the names of telescopes in the literature. Hubble, Chandra and Spitzer seem to have taken turns in hogging the limelight, much as COBE, WMAP and Planck have each contributed to our knowledge of the cosmic microwave background in successive decades. References to Planck are still on the rise. CREDIT: Robert Simpson.

After the meeting, however, Robert has taken his initial results and explored the astronomical literature and his new corpus of data on the literature. He’s explored various visualisations of the data, including word matrixes for related terms and for various astro-chemistry.

Correlation between terms related to Active Galactic Nuclei
Correlation between terms related to Active Galactic Nuclei (AGN). The opacity of each square represents the strength of the correlation between the terms. CREDIT: Robert Simpson.

He’s also taken a look at authorship in astronomy and is starting to find some interesting trends.

Fraction of astronomical papers published with one, two, three, four or more authors
Fraction of astronomical papers published with one, two, three, four or more authors. CREDIT: Robert Simpson

You can see that single-author papers dominated for most of the 20th century. Around 1960, we see the decline begin, as two- and three-author papers begin to become a significant chunk of the whole. In 1978, author papers become more prevalent than single-author papers.

Compare the number of active research astronomers to the number of papers published each year
Compare the number of “active” research astronomers to the number of papers published each year (across all the major journals). CREDIT: Robert Simpson.

Here we see that people begin to outpace papers in the 1960s. This may reflect the fact that as we get more technical as a field, and more specialised, it takes more people to write the same number of papers, which is a sort of interesting result all by itself.

Interview with Robert Simpson: Behind the project and what lies ahead

I recently talked with Rob about the work he, Karen Masters, and Sarah Kendrew did at the meeting, and the work he’s been doing since with the newly gathered data.

What made you think about data mining the ADS?

Robert Simpson: At the .Astronomy 4 Hack Day in July, Sarah Kendrew had the idea to try to do an astronomy version of BrainSCANr, a project that generates new hypotheses in the neuroscience literature. I’ve had a go at mining ADS and arXiv before, so it seemed like a great excuse to dive back in.

Do you think there might be actual science that could be done here?

Robert Simpson: Yes, in the form of finding questions that were unexpected. With such large volumes of peer-reviewed papers being produced daily in astronomy, there is a lot being said. Most researchers can only try to keep up with it all — my daily RSS feed from arXiv is next to useless, it’s so bloated. In amongst all that text, there must be connections and relationships that are being missed by the community at large, hidden in the chatter. Maybe we can develop simple techniques to highlight potential missed links, i.e. generate new hypotheses from the mass of words and data.

Are the results coming out of the work useful for auditing academics?

Robert Simpson: Well, perhaps, but that would be tricky territory in my opinion. I’ve only just begun to explore the data around authorship in astronomy. One thing that is clear is that we can see a big trend toward collaborative work. In 2012, only 6% of papers were single-author efforts, compared with 70+% in the 1950s.

The average number of authors per paper since 1827
The above plot shows the average number of authors, per paper since 1827. CREDIT: Robert Simpson.

We can measure how large groups are becoming, and who is part of which groups. In that sense, we can audit research groups, and maybe individual people. The big issue is keeping track of people through variations in their names and affiliations. Identifying authors is probably a solved problem if we look at ORCID.

What about citations? Can you draw any comparisons with h-index data?

Robert Simpson: I haven’t looked at h-index stuff specifically, at least not yet, but citations are fun. I looked at the trends surrounding the term “dark matter” and saw something interesting. Mentions of dark matter rise steadily after it first appears in the late ’70s.

Compare the term dark matter with related terms
Compare the term “dark matter” with a few other related terms: “cosmology,” “big bang,” “dark energy,” and “wmap.” You can see cosmology has been getting more popular since the 1990s, and dark energy is a recent addition. CREDIT: Robert Simpson.

In the data, astronomy becomes more and more obsessed with dark matter — the term appears in 1% of all papers by the end of the ’80s and 6% today.

Looking at citations changes the picture. The community is writing papers about dark matter more and more each year, but they are getting fewer citations than they used to (the peak for this was in the late ’90s). These trends are normalised, so the only regency effect I can think of is that dark matter papers take more than 10 years to become citable. Either that or dark matter studies are currently in a trough for impact.

Can you see where work is dropped by parts of the community and picked up again?

Robert Simpson: Not yet, but I see what you mean. I need to build a better picture of the community and its components.

Can you build a social graph of astronomers out of this data? What about (academic) family trees?

Robert Simpson: Identifying unique authors is my next step, followed by creating fingerprints of individuals at a given point in time. When do people create their first-author papers, when do they have the most impact in their careers, stuff like that.

What tools did you use? In hindsight, would you do it differently?

I’m using Ruby and Perl to grab the data, MySQL to store and query it, JavaScript to display it (Google Charts and D3.js). I may still move the database part to MongoDB because it was designed to store documents. Similarly, I may switch from ADS to arXiv as the data source. Using arXiv would allow me to grab the full text in many cases, even if it does introduce a peer-review issue.

What’s next?

Robert Simpson: My aim is still to attempt real hypothesis generation. I’ve begun the process by investigating correlations between terms in the literature, but I think the power will be in being able to compare all terms with all terms and looking for the unexpected. Terms may correlate indirectly (via a third term, for example), so the entire corpus needs to be processed and optimised to make it work comprehensively.

Science between the cracks

I’m really looking forward to seeing more results coming out of Robert’s work. This sort of analysis hasn’t really been possible before. It’s showing a lot of promise both from a sociological angle, with the ability to do research into how science is done and how that has changed, but also ultimately as a hypothesis engine — something that can generate new science in and of itself. This is just a hack day experiment. Imagine what could be done if the literature were more open and this sort of analysis could be done across fields?

Right now, a lot of the most interesting science is being done in the cracks between disciplines, but the hardest part of that sort of work is often trying to understand the literature of the discipline that isn’t your own. Robert’s project offers a lot of hope that this may soon become easier.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl