Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 24 2013

August 21 2013

Free gis Datasets - Categorised List

Free #gis Datasets - Categorised List
http://freegisdata.rtwilson.com

This page contains a categorised list of links to over 300 sites providing freely available geographic datasets - all ready for loading into a Geographic Information System.

#qgis #shapefile #data

August 17 2013

August 16 2013

*The vanishing cost of guessing*

The vanishing cost of guessing

http://radar.oreilly.com/2013/08/the-vanishing-cost-of-guessing.html

[C]ritics charge that big data will make us stick to constantly optimizing what we already know, rather than thinking out of the box and truly innovating. We’ll rely on machines for evolutionary improvements, rather than revolutionary disruption. An abundance of data means we can find facts to support our preconceived notions, polarizing us politically and dividing us into “filter bubbles” of like-minded intolerance. And it’s easy to mistake correlation for causality, leading us to deny someone medical coverage or refuse them employment because of a pattern over which they have no control, taking us back to the racism and injustice of Apartheid or Redlining.

// via soup.io where the complete article is available - here

#données - #prédiction #prévention #anticipation #corrélation - #coût #préjugé

#Daten #Vorhersage #Vorsorge #Analyse #Zusammenhang - #Kosten #Vorurteil - #gelenkte #Wahrnehmung

#big_data #data - #prediction #preventing #analysis - #cost #prejudice #bias #preconception

August 12 2013

Simple, fast map data editing | MapBox

Simple, fast map data editing | MapBox
http://www.mapbox.com/blog/geojsonio-announce

We are trying to make it easier to draw, change, and publish #maps. Some of the most important geospatial data is the information we know, observe, and can draw on a napkin. This is the kind of #data that we also like to collaborate on, like collecting bars that have free wifi ✎ or favorite running routes.

http://geojson.io aims to fix that. It’s an an open source project built with #MapBox.js, #GitHub's powerful new Gist and #GeoJSON features, and an array of microlibraries that power import, export, editing, and lots more.

#données #cartographie #interface

August 10 2013

Four short links: 12 August 2013

  1. List of Malware pcaps and SamplesCurrently, most of the samples described have the corresponding samples and pcaps available for download.
  2. InterTwinkles — open source platform built from the ground up to help small democratic groups to do process online. It provides structure to improve the efficiency of specific communication tasks like brainstorming and proposals. (via Willow Bl00)
  3. Lavabit, Privacy, Seppuku, and Game Theory (Vikram Kumar) — Mega’s CEO’s private blog, musing about rational responses to malstates.
  4. Telepath Keylogger (Github) — A happy Mac keylogger for Quantified Self purposes. (via Nick Winter)

August 03 2013

TANAGRA - Un logiciel de data mining, de statistique et d'analyse de données pour l'enseignement et…

TANAGRA - Un logiciel de data mining, de statistique et d’analyse de données pour l’enseignement et la recherche
http://eric.univ-lyon2.fr/~ricco/tanagra/fr/tanagra.html #data #datamining

August 01 2013

Four short links: 2 August 2013

  1. Unhappy Truckers and Other Algorithmic ProblemsEven the insides of vans are subjected to a kind of routing algorithm; the next time you get a package, look for a three-letter letter code, like “RDL.” That means “rear door left,” and it is so the driver has to take as few steps as possible to locate the package. (via Sam Minnee)
  2. Fuel3D: A Sub-$1000 3D Scanner (Kickstarter) — a point-and-shoot 3D imaging system that captures extremely high resolution mesh and color information of objects. Fuel3D is the world’s first 3D scanner to combine pre-calibrated stereo cameras with photometric imaging to capture and process files in seconds.
  3. Corporate Open Source Anti-Patterns (YouTube) — Brian Cantrill’s talk, slides here. (via Daniel Bachhuber)
  4. Hacking for Humanity) (The Economist) — Getting PhDs and data specialists to donate their skills to charities is the idea behind the event’s organizer, DataKind UK, an offshoot of the American nonprofit group.

July 29 2013

*Bad immigration policy doesn't spring from bad stats* ❝It's not our immigration data – or…

Bad immigration policy doesn’t spring from bad stats

It’s not our immigration data – or perceived lack of – that’s the problem, in fact we know far more than we did a decade ago.

“Disney World can keep better track of its visitors than Britain” was the Mail on Sunday’s summary of the public administration committee’s report on migration statistics. That’s probably true. But it’s hardly surprising; Disney World’s task is, for obvious reasons, rather easier. In particular, Disney doesn’t have to deal with the fact that well over 500 million people are free to come to the UK whenever they like, and stay for as long as they want.

http://www.guardian.co.uk/commentisfree/2013/jul/29/bad-immigration-policy-stats-data

#migration #politique_migratoire #statistiques #data #données

July 27 2013

PLOS ONE : si on partage des données, qui les utilisera ?

PLOS ONE : si on partage des données, qui les utilisera ?
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0067332

Un article de recherche sur le partage des données scientifiques, une pratique encore peu courante, qui se pratique surtout par échanges interpersonnels. Tags : internetactu internetactu2net fing #science #Data

Tags: science Data

July 23 2013

Interactive map: bike movements in New York City and Washington, D.C.

From midnight to 7:30 A.M., New York is uncharacteristically quiet, its Citibikes–the city’s new shared bicycles–largely stationary and clustered in residential neighborhoods. Then things begin to move: commuters check out the bikes en masse in residential areas across Manhattan and, over the next two hours, relocate them to Midtown, the Flatiron district, SoHo, and Wall Street. There they remain concentrated, mostly used for local trips, until they start to move back outward around 5 P.M.

Washington, D.C.’s bike-share program exhibits a similar pattern, though, as you’d expect, the movement starts a little earlier in the morning. On my animated map, both cities look like they’re breathing–inhaling and then exhaling once over the course of 12 hours or so.

The map below shows availability at bike stations in New York City and the Washington, D.C. area across the course of the day. Solid blue dots represent completely-full bike stations; white dots indicate empty bike stations. Click on any station to see a graph of average availability over time. I’ve written a few thoughts on what this means about the program below the graphic.

We can see some interesting patterns in the bike share data here. First of all, use of bikes for commuting is evidently highest in the residential areas immediately adjacent to dense commercial areas. That makes sense; a bike commute from the East Village to Union Square is extremely easy, and that’s also the sort of trip that tends to be surprisingly difficult by subway. The more remote bike stations in Brooklyn and Arlington exhibit fairly flat availability profiles over the course of the day, suggesting that to the degree they’re used at all, it’s mostly for local trips.

A bit about the map: I built this by scraping the data feeds that underlie the New York and Washington real-time availability maps every 10 minutes and storing them in a database. (Here is New York’s feed; here is Washington’s.) I averaged availability by station in 10-minute increments over seven weekdays of collected data. The map uses JavaScript (mostly jQuery) to manipulate an SVG image–changing opacity of bike-share stations depending to represent availability and rendering a graph every time a station is clicked. I used Python and MySQL for the back-end work of collecting the data, aggregating it, and publishing it to a JSON file that the front-end code downloads and parses.

This map, by the way, is an extremely simple example of what’s possible when the physical world is instrumented and programmable. I’ve written about sensor-laden machinery in my research report on the industrial internet, and we plan to continue our work on the programmable world in the coming months.

July 10 2013

Global migration and debt

Global migration and debt

http://flowingdata.com/2013/07/09/global-migration-and-debt

Ça c’est pas mal du tout

Global Economic Dynamics, in collaboration with 9elements, provides an explorer that shows country relationships through migration and debt. Inspired by a New York Times graphic from a few years ago, which was a static look at debt, the GED interactive allows you to select among 46 countries and browse data from 2000 through 2010.

http://flowingdata.com/wp-content/uploads/2013/07/GED-Viz-625x521.png

Each outer bar represents a country, and each connecting line either indicates migration between two countries or bank claims, depending on which you choose to look at. You can also select several country indicators, which are represented with bubbles. (The image above shows GDP.) Although, that part of the visualization is tough to read with multiple indicators and countries.

#data #statistiques #visualisation #dette #agglomérations #villes

July 08 2013

Les échanges de mails stockés un an ❝Ce sont nos confrères du Standaard qui livrent l'information…

Les échanges de mails stockés un an

Ce sont nos confrères du Standaard qui livrent l’information ce matin : le gouvernement vient de déposer au Parlement un projet de loi obligeant les fournisseurs télécoms (Belgacom, Telenet…) à désormais stocker, pendant un an, toutes les traces de communications transitant par leurs serveurs. Commandée par une directive européenne, cette obligation impose de sauvegarder à la fois les preuves de communications téléphoniques (coups de fil et SMS) mais aussi les échanges d’e-mails. L’objectif, ici, vise à aider la justice ou la sécurité d’Etat dans sa lutte contre la grande criminalité. (...) Nouveauté, par contre, pour les échanges d’e-mails. Les opérateurs devront enregistrer pendant 12 mois les adresses IP (le numéro unique à chaque ordinateur) d’où partent ou arrivent des messages électroniques. Leur

http://www.lesoir.be/276922/article/actualite/belgique/2013-07-08/echanges-mails-stockes-un-an
#belgique #données #data #privacy

June 11 2013

Four short links: 11 June 2013

  1. For Example — amazing discussion of 3D visualization techniques, full of examples using the D3.js library and bl.ocks.org example gist system. Gorgeous and informative.
  2. Anti-Gravity 3D Printer — uses strands to sculpt on any surface. (via Slashdot)
  3. How 3D Printing Will Rebuild Reality (BoingBoing) — But even though home 3D-printing has received substantial publicity of late, it is in the industrial sector where the technology will probably make its most significant near-term impact on the world both by manufacturing improved commercial products and by stimulating industry to develop next-generation fab methods and machines that could one day truly bring 3D-printing home to users in a real way.
  4. The Emotional Side of Big Data — Personal Democracy Forum 2013 talk by Sara Critchfield, on reframing emotion as data for decision-making. (via Quartz)

June 03 2013

Big data vs. big reality

This post originally appeared on Cumulus Partners. It’s republished with permission.

Quentin Hardy’s recent post in the Bits blog of The New York Times touched on the gap between representation and reality that is a core element of practically every human enterprise. His post is titled “Why Big Data is Not Truth,” and I recommend it for anyone who feels like joining the phony argument over whether “big data” represents reality better than traditional data.

In a nutshell, this “us” versus “them” approach is like trying to poke a fight between oil painters and water colorists. Neither oil painting nor water colors are “truth”; both are forms of representation. And here’s the important part: Representation is exactly that — a representation or interpretation of someone’s perceived reality. Pitting “big data” against traditional data is like asking you if Rembrandt is more “real” than Gainsborough. Both of them are artists and both painted representations of the world they perceived around them.

The problem with false arguments like the one posed by Hardy is that they obscure the value of data — traditional data and big data — and the impact of data on our culture. I’m now working my way through “Raw Data” is an Oxymoron, an anthology of short essays about data. I recommend it for anyone who is seriously interested in thinking about the many ways in which data has influenced (and continues influencing) our lives. I especially recommend “facts and FACTS: Abolitionists’ Database Innovations,” by Ellen Gruber Garvey. As its title suggests, the essay focuses on what proves to be an absolutely fascinating period of U.S. history in which the anti-slavery movement harvested data from real advertisements in Southern newspapers to paint a vivid and believable picture of the routine horrors inflicted by the slave system on real human beings.

That 19th century use of data mining built support for the anti-slavery movement, both in the U.S. and in England. The data played a key role in making the case for abolishing slavery — even though it required the bloodiest war in U.S. history to make abolition a fact.

Data itself has no quality. It’s what you do with it that counts.

May 28 2013

Four short links: 28 May 2013

  1. My Little Geek — children’s primer with a geeky bent. A is for Android, B is for Binary, C is for Caffeine …. They have a Kickstarter for two sequels: numbers and shapes.
  2. Visible CSS RulesEnter a url to see how the css rules interact with that page.
  3. How to Work Remotely — none of this is rocket science, it’s all true and things we had to learn the hard way.
  4. Raspberry Pi Twitter Sentiment Server — step-by-step guide, and github repo for the lazy. (via Jason Bell)

May 22 2013

Four short links: 22 May 2013

  1. XBox One Kinect Controller (Guardian) — the new Kinect controller can detect gaze, heartbeat, and the buttons on your shirt.
  2. Surveillance and the Internet of Things (Bruce Schneier) — Lots has been written about the “Internet of Things” and how it will change society for the better. It’s true that it will make a lot of wonderful things possible, but the “Internet of Things” will also allow for an even greater amount of surveillance than there is today. The Internet of Things gives the governments and corporations that follow our every move something they don’t yet have: eyes and ears.
  3. Daniel Dennett’s Intuition Pumps (extract)How to compose a successful critical commentary: 1. Attempt to re-express your target’s position so clearly, vividly and fairly that your target says: “Thanks, I wish I’d thought of putting it that way.” 2. List any points of agreement (especially if they are not matters of general or widespread agreement). 3. Mention anything you have learned from your target.4. Only then are you permitted to say so much as a word of rebuttal or criticism.
  4. New Data Science Toolkit Out (Pete Warden) — with population data to let you compensate for population in your heatmaps. No more “gosh, EVERYTHING is more prevalent where there are lots of people!” meaningless charts.

May 14 2013

Four short links: 14 May 2013

  1. Behind the Banner — visualization of what happens in the 150ms when the cabal of data vultures decide which ad to show you. They pass around your data as enthusiastically as a pipe at a Grateful Dead concert, and you’ve just as much chance of getting it back. (via John Battelle)
  2. pwnpad — Nexus 7 with Android and Ubuntu, high-gain USB bluetooth, ethernet adapter, and a gorgeous suite of security tools. (via Kyle Young)
  3. Terraa simple, statically-typed, compiled language with manual memory management [...] designed from the beginning to interoperate with Lua. Terra functions are first-class Lua values created using the terra keyword. When needed they are JIT-compiled to machine code. (via Hacker News)
  4. Metaphor Identification in Large Texts Corpora (PLOSone) — The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.

May 13 2013

Big data, cool kids

My data's bigger than yours!My data's bigger than yours!

My data’s bigger than yours!

The big data world is a confusing place. We’re no longer in a market dominated mostly by relational databases, and the alternatives have multiplied in a baby boom of diversity.

These child prodigies of the data scene show great promise but spend a lot of time knocking each other around in the schoolyard. Their egos can sometimes be too big to accept that everybody has their place, and eyeball-seeking media certainly doesn’t help.

POPULAR KID: Look at me! Big data is the hotness!
HADOOP: My data’s bigger than yours!
SCIPY: Size isn’t everything, Hadoop! The bigger they come, the harder they fall. And aren’t you named after a toy elephant?
R: Backward sentences mine be, but great power contains large brain.
EVERYONE: Huh?
SQL: Oh, so you all want to be friends again now, eh?!
POPULAR KID: Yeah, what SQL said! Nobody really needs big data; it’s all about small data, dummy.

The fact is that we’re fumbling toward the adolescence of big data tools, and we’re at an early stage of understanding how data can be used to create value and increase the quality of service people receive from government, business and health care. Big data is trumpeted in mainstream media, but many businesses are better advised to take baby steps with small data.

Data skeptics are not without justification. Our use of “small data” hasn’t exactly worked out uniformly well so far, crude numbers often being misused either knowingly or otherwise. For example, over-reliance by bureaucrats on the results of testing in schools is shaping educational institutions toward a tragically homogeneous mediocrity.

The promise and the gamble of big data is this: that we can advance past the primitive quotas of today’s small data into both a sophisticated statistical understanding of an entire system and insight that focuses down to the level of an individual. Data gives us both telescope and microscope, in detail we’ve never had before.

Inside this tantalizing vision lies many of the debates in today’s data world: the need for highly skilled data scientists to effect this change, and the worry that we’ll inadvertently enslave ourselves to Big Brother, even with the best of intentions.

So, as the data revolution moves forward, it’s important to take the long view. The foment of tools and job titles and algorithms is significant, but ultimately it’s background to our larger purposes as people, businesses and government. That’s one reason why, at O’Reilly, we’ve taken the motto “Making Data Work” for Strata. Data, not technology, is the heartbeat of our world because it relates directly to ourselves and the problems we want to solve.

This is also the reason that the Strata and Hadoop World conferences take a broad view of the subject: ranging from the business topics to the tools and data science. If you talk to Hadoop’s most seasoned advocates, they don’t speak only about the tech; they talk about the problems they’re able to solve. The tools alone are never enough; the real enabler is the framework of people and understanding in which they’re used.

Our mission is to help people make sense of the state of the data world and use this knowledge to become both more competitive and more creative. We believe that’s best served by creating context in which we think about our use of data as well as serving the growing specialist communities in data.

Enjoy the noise and the energy from the growing data ecosystem, but keep your eyes on the problems you want to solve.

The Strata and Hadoop World Call for Proposals is open until midnight EDT, Thursday May 16.

May 06 2013

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. And here we see what’s really happening: to remove one head of the double-headed arrow, we use “common sense” to choose between two stories: one that’s merely silly, and another that’s so ludicrous we never even think about it. Seems to work here (for a very limited value of “work”); but if I’ve learned one thing, it’s that good old common sense is frequently neither common nor sensible. For more realistic correlations, it certainly seems ironic that we’re doing all this data analysis just to end up relying on common sense.

Now let’s look at something equally hypothetical that isn’t silly. A drug is correlated with reduced risk of death due to heart failure. Good thing, right? Yes–but why? What if the drug has nothing to do with heart failure, but is really an anti-depressant that makes you feel better about yourself so you exercise more? If you’re in the “correlation is as good as causation” club, doesn’t make a difference: you win either way. Except that, if the key is really exercise, there might be much better ways to achieve the same result. Certainly much cheaper, since the drug industry will no doubt price the pills at $100 each. (Tangent: I once saw a truck drive up to an orthopedist’s office and deliver Vioxx samples with a street value probably in the millions…) It’s possible, given some really interesting work being done on the placebo effect, that a properly administered sugar pill will make the patient feel better and exercise, yielding the same result. (Though it’s possible that sugar pills only work as placebos if they’re expensive.) I think we’d like to know, rather than just saying that correlation is just as good as causation, if you have a lot of data.

Perhaps I haven’t gone far enough: with enough data, and enough dimensions to the data, it would be possible to detect the correlations between the drug, psychological state, exercise, and heart disease. But that’s not the point. First, if correlation really is as good as causation, why bother? Second, to analyze data, you have to collect it. And before you collect it, you have to decide what to collect. Data is socially constructed (I promise, this will be the subject of another post), and the data you don’t decide to collect doesn’t exist. Decisions about what data to collect are almost always driven by the stories we want to tell. You can have petabytes of data, but if it isn’t the right data, if it’s data that’s been biased by preconceived notions of what’s important, you’re going to be misled. Indeed, any researcher knows that huge data sets tend to create spurious correlations.

Causation has its own problems, not the least of which is that it’s impossible to prove. Unfortunately, that’s the way the world works. But thinking about cause and how events relate to each other helps us to be more critical about the correlations we discover. As humans we’re storytellers, and an important part of data work is building a story around the data. Mere correlations arising from a gigantic pool of data aren’t enough to satisfy us. But there are good stories and bad ones, and just as it’s possible to be careful in designing your experiments, it’s possible to be careful and ethical in the stories you tell with your data. Those stories may be the closest we get ever get to an understanding of cause; but we have to realize that they’re just stories, that they’re provisional, and that better evidence (which may just be correlations) may force us to retell our stories at any moment. Correlation is as good as causation is just an excuse for intellectual sloppiness; it’s an excuse to replace thought with an odd kind of “common sense,” and to shut down the discussion that leads to good stories and understanding.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!

Schweinderl