Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 01 2012

Big data is our generation’s civil rights issue, and we don’t know it

Data doesn’t invade people’s lives. Lack of control over how it’s used does.

What’s really driving so-called big data isn’t the volume of information. It turns out big data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data.

For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.

The big data trifectaThe big data trifecta

I’d first heard this as the “three V’s of data”: Volume, Variety, and Velocity. Traditionally, getting two was easy but getting three was very, very, very expensive.

The advent of clouds, platforms like Hadoop, and the inexorable march of Moore’s Law means that now, analyzing data is trivially inexpensive. And when things become so cheap that they’re practically free, big changes happen — just look at the advent of steam power, or the copying of digital music, or the rise of home printing. Abundance replaces scarcity, and we invent new business models.

In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.

That needs repeating:

You decide what data is about the moment you define its schema.

With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, big data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected — sometimes called a schema-less query. This means we collect information long before we decide what it’s for.

And this is a dangerous thing.

When bank managers tried to restrict loans to residents of certain areas (known as redlining) Congress stepped in to stop it (with the Fair Housing Act of 1968). They were able to legislate against discrimination, making it illegal to change loan policy based on someone’s race.

Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936
Home Owners’ Loan Corporation map showing redlining of “hazardous” districts in 1936.

“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you — right? That’s just better service.

In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit:

Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”

Some of the things white men liked in 2010, according to OKCupidSome of the things white men liked in 2010, according to OKCupidWe’re seeing the start of this slippery slope everywhere from tailored credit-card limits like this one to car insurance based on driver profiles. In this regard, big data is a civil rights issue, but it’s one that society in general is ill-equipped to deal with.

We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned — some of it was a bit too controversial). They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.

Now run this backwards. If I know you like these things, or see you mention them in blog posts, on Facebook, or in tweets, then there’s a good chance I know your gender and your race, and maybe even your religion and your sexual orientation. And that I can personalize my marketing efforts towards you.

That makes it a civil rights issue.

If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead, I could use it to guess at your racial background. And then I could use that data to deny you a loan.

Want another example? Check out Private Data In Public Ways, something I wrote a few months ago after seeing a talk at Big Data London, which discusses how publicly available last name information can be used to generate racial boundary maps:

Screen from the Mapping London projectScreen from the Mapping London project
Screen from the Mapping London project.

This TED talk by Malte Spitz does a great job of explaining the challenges of tracking citizens today, and he speculates about whether the Berlin Wall would ever have come down if the Stasi had access to phone records in the way today’s governments do.

So how do we regulate the way data is used?

The only way to deal with this properly is to somehow link what the data is with how it can be used. I might, for example, say that my musical tastes should be used for song recommendation, but not for banking decisions.

Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation. Or it can be done through legislation, which has about as much chance of success as regulating spam: it feels great, but it’s damned hard to enforce.

There are brilliant examples of how a quantified society can improve the way we live, love, work, and play. Big data helps detect disease outbreaks, improve how students learn, reveal political partisanship, and save hundreds of millions of dollars for commuters — to pick just four examples. These are benefits we simply can’t ignore as we try to survive on a planet bursting with people and shaken by climate and energy crises.

But governments need to balance reliance on data with checks and balances about how this reliance erodes privacy and creates civil and moral issues we haven’t thought through. It’s something that most of the electorate isn’t thinking about, and yet it affects every purchase they make.

This should be fun.

This post originally appeared on Solve for Interesting. This version has been lightly edited.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20


August 15 2011

Dominant form of journalism foretold by Reynolds Journalism Institute

href="">Reynolds Journalism Institute proposalI recently had a talk with href="">Bill Densmore, a
consulting fellow at the Reynolds Journalism Institute, about a href="">comprehensive
proposal for re-establishing news providers on an economically and
socially sustainable foundation. I was not entirely happy with the
vision he laid out, but I recognized right away that it would triumph
and become a global standard. This paper deserves a look from anyone
interested in publishing, social networking, or democratic discourse.

It's not normal for me to be so insistent on a trend. Before you think
I'm over-dramatizing the situation, consider the following trends:

  • Moves by news organizations to charge for content. Although many sites
    (including O'Reilly, through href="">Safari Books Online) have
    been offering subscriptions for some time, the recent retreat from
    free content by the New York Times is sure to lead the rest of the

  • The increasing dependence most of us have on other people to choose
    what news we view, whether through aggregation feeds or friends on
    social networks.

  • The trend (which I currently find to be a gimmick) in newspapers to
    include excerpts from blogs by non-journalists.

  • Strains on the ability of advertising to generate revenue, and the
    ever-intensifying determination of sites to gather more information
    about viewers in order to target them more effectively.

  • Greater consolidation in traditional media, driven by the realization
    that no site can go it alone.

  • The spread of federated single sign-on, mostly through OpenID, egged
    on by the US government in its 2010 href="">National Strategy
    for Trusted Identities in Cyberspace. (Alex Howard covered the proposal on Radar.)

Consider these, and compare Densmore's proposal. He has found the way
for news institutions to capitalize on what's already happening, and
the result is a formula they're going to like. Of course, the final
formula may be somewhat different from his, and perhaps no media mogul
will ever stand up before a crowd of thousands and say, "We're using
the RJI model for content aggregation." But in its outlines, we see
here the future of news.

And that means we had better look at the possible negative effects of
the change and insist that the solution prevent them.

News the RJI way

Densmore's paper is quite extended, because he needs to explain the
business end of his proposal to people who don't understand the
economics of current publishing, and the architectural end to people
unfamiliar with the underlying technology. He resorts to invented
terms, which I'll try to elude as I offer a summary that I hope will
be enough to push forward discussion.

The content aggregation and pricing layer

The change to the news will be driven by institutions that already
have a news brand and can draw a loyal following. Some may be new,
others may be familiar brands such as the New York Times, and some may
be local news outlets with specialized content that a small audience
values highly. These news institutions will create more enticing
landing places by licensing content from other news providers around
the world.

Those who have followed online publishing closely may remember the
stillborn Automated Content Access
Protocol (ACAP)
, which I href="">
critiqued four years ago. I think ACAP is back, but not as a
hitchhiker on standard search techniques. The RJI proposal echoes its
basic strategy for finding and licensing content.

Incidentally, Densmore dislikes the term "micropayments" because it
seems to devalue content, whereas his proposal does the opposite. He
also dislikes the term "paywall" because it represents a critique of
subscription models that have worked fine since the beginning of

How the user experiences the news

The innovation in Densmore's proposal is the use of federated single
sign-on to let readers leap from one site to another. The content is
not just republished and rebranded on the aggregator's site, but is
linked to directly. That means the original site can promote its own
brand, add updates, and incorporate reader comments.

Currently, if you want to read articles on two different subscription
sites, you need to sign up and pay at each. Densmore calls this a
broken user interface. Busy people want the content fed to them, even
if it requires payment, and want to just pay a monthly fee and forget
about it. The federated identity model allows this. The source of each
article can charge a small amount, such as half a cent, for each view,
and the aggregator can worry about harmonizing the economics of
subscriptions and license fees.

Personalization will become the crucial advantage of news aggregators.
Netizens' current strategies for spiking their news with variety
(adding sites to RSS feeds, scanning streams on Twitter or Google
Plus, and so forth) are ad hoc and produce inconsistent results. These
amateur and crowdsourced techniques will become professionalized. A
great deal of the value of a subscription, Densmore is betting, will
lie in the choices made by aggregators for the front page each of us
sees upon visiting their site each morning.

We will have many personas for news sites. I may use one persona for
doing technical research at work, another for checking on
international political events on the weekend, and a third for
following my favorite musicians.

TOC Frankfurt 2011 — Being held on Tuesday, Oct. 11, 2011, TOC Frankfurt will feature a full day of cutting-edge keynotes and panel discussions by key figures in the worlds of publishing and technology.

Save 100€ off the regular admission price with code TOC2011OR

The role of advertising

Few news sites will be able to break even through subscriptions alone.
Ads, as you may have guessed by now, will grab an accelerated ride on
the jetstream of personalization as well. Like it or not, we
will be known by our news providers, and hence indirectly by
those who want to sell us goods and services.

In fact, there's little about Densmore's proposal that is unique to
the news or publishing industries. It provides a general platform for
servers to learn about visitors and capitalize on that knowledge.
Publishers may need the platform more urgently than other industries
because what we offer is under more pressure from free and
crowdsourced alternatives, but the principles will appeal to anyone
who wants a deeper relationship with a visitor than a one-time click.

Densmore is learning from Google here. The techniques he advises news
sites to use are stepped-up versions of those that made Google a
200-billion-dollar company. He adds what he calls "advisor-tising,"
which involves some opt-in from visitors. If Densmore prevails, people
who visit a search engine by default to look up the news will instead
go to a news site of their choice.

Heading off dangers

The RJI model may appeal to consumers because it could help them get
more high-quality news with less effort. But what consumers think is
ultimately irrelevant. The model's main appeal is to news sites and
advertisers: it will save news sites from their long steady decline,
and serve up a better class of customer to advertisers. There will
probably always remain sites that can serve adequate advertising with
requiring subscriptions, as well as sites that sponsor content as a
public service or to drive an agenda. But the news mainstream, I'm
convinced, will be delivered to known personas.

This will drive tremendous information sharing about each of us by
news organizations and advertisers, but Densmore says, "The cat is
already out of the bag." Abuses will have to be dampened by laws,
regulations, and industry standards.

One detail I'd like to incorporate into the new system is the ability
to walk away from a persona. It would be nearly impossible to do that
now, because you'll need a credit card to sign up with a new persona
and it can then be instantly linked to the old one. I would like
regulations requiring companies to discard what they know about you,
upon your request. In exchange, they can charge higher subscription
rates for new personas, under the presumption that advertising is less
lucrative. (However, the personalized news choices they deliver will
also be less relevant, which makes charging more money for new
personas a dilemma.)

Any major change in a marketplaces calls for a look at possibilities
of market consolidation. In this case, one site could garner a
preponderance of the readership, giving it untoward control over the
news we see. I don't worry much about this because I expect news
stories to continue breaking out through social networking, and
because some Fox News or Al Jazeera will be able to emerge to break
hidebound monopolies. Densmore wards off bias through a neutral

The Information Trust Association (ITA) would create protocols and
business rules that enable appropriate network collaboration and
exchange -- a level playing field. The ITA would be guided by
publishers, broadcasters, telecom and technology companies, account
managers, trade groups and the public.

Even without a monopoly, there's a danger of us becoming passive
consumers Once each of us has his own front page. Densmore issues a
call, along with journalistic visionary href="">Dan Gillmor, for readers to become
informed information consumers. This is particularly important because
we don't want to receive just updates on topics we've already
expressed interest in. We want to know important new topics.

The history of privacy breaches has also shown that, in the absence of
strong EU-style privacy laws, agreements by publishers to respect
reader privacy cannot be trusted. There will be loopholes and too many
temptations to bend the rules. The current Supreme Court believes that
advertisers have a nearly inviolable right to send people ads, as href=">their ruling on the recent
Vermont drug marketing law shows.

I'll end by saying that the role of citizen journalists hasn't been
clearly integrated into the proposed regime. A hefty section of the
paper is devote to them, but it's not clear to me whether blogs just
another set of feeds into the personalized channel, or whether user
recommendations will be factored in to the choices offered by news
sites. This is an interesting topic to watch as RJI seeks partners
and tries to implement the system.

As Densmore told me on the phone, numerous social barriers to his
proposal have to be overcome. News sites have to be persuaded to try
his form of deep-delving aggregation. Readers have to be persuaded
that the news is once again worth paying for and that they can benefit
by sharing their preferences with news providers. It will be quite
some time before we can determine where the proposal is going.


May 06 2011

Search Notes: The high cost of search market share

Here's what caught my attention in the search world this week.

Bing's partnership with RIM: Will distribution lead to increased mobile search share?

Search market share isn't just about providing great search results. It's also about distribution. Become the default search provider in an application or on a device, and as a search engine, you've at least partially won the battle for those users (unless your search experience is so bad it drives users from their normal behavior of not changing defaults right to your competitor).

Google currently has 97% mobile market share in the United States, which is partially due to distribution — both with its Android OS and as the default search on the iPhone. (And consumers are increasingly interested in Android and iPhone over RIM and Microsoft Windows mobile.)

BingBut Bing is trying to change the market share balance, in part by becoming the default search provider on RIM BlackBerry devices. Microsoft Smartphones make up 9% of the SmartPhone market (vs. more than 50% for the combination of Android and iPhone). RIM makes up an additional 33%.

Some think that Microsoft's aggressive pursuit of distribution deals makes poor business sense:

Microsoft's Bing search engine is indeed gaining some share of search queries in the US market (globally, Bing is nowhere). But it is gaining this share at an absolutely mind-boggling cost. Specifically, Microsoft is gaining share for Bing by doing spectacularly expensive distribution deals, deals that don't even come close to paying for themselves in additional revenue.

How much is Microsoft spending to buy market share for Bing?

Based on an analysis of Microsoft's financial statements, Bing is paying about 3X as much for every incremental search query as it generates in revenue from that query.

Continued personalization of Google News

Radar's Alex Howard, writing recently about research around how we increasingly look online for political news, noted:

Polarization can express itself in how people group online and offline. As with so many activities online, political information gathering online requires news consumers to be more digitally literate. That may mean recognizing the potential for digital echo chambers, where unaware citizens become trapped in a filter bubble created by rapidly increasing personalization in search, commercial and social utilities like Google, Amazon and Facebook.

The research, conducted by the Pew Internet and Life Project, found that actually, we are exposed to a variety of viewpoints online. But those who are concerned about potential filter bubbles may be wary of new personalization features of Google News that use previous Google News activity to shape the "News for you" and a new "Recommended Sections" feature. Google says personalization uses both "subjects and sources," so it will expose content based on topics you're interested in (which may come from a variety of sources and viewpoints) and sources you've clicked on (which may be more likely to share your perspective).

Search and Osama Bin Laden

News events always cause search spikes, but the death of Osama Bin Laden caused an all out search frenzy. Yahoo reported a 98,550% increase in searches for the name on May 1, in part driven by teenagers wondering who he was.

Google Trends result for May 2 2011
Google Trends result for May 2, 2011.

Over on Search Engine Land, Danny Sullivan compared Google results on September 11, 2001, when Google posted a message on their home page advising searchers looking for new information to go elsewhere, vs. May 1, 2011, when a combination of news articles and tweets provided up-to-the minute news in search results. (Google's inability to provide real-time news coverage on September 11, 2011 led to the creation of Google News.)


March 10 2011

Search Notes: The future is mobile. And self-driving cars

In the search world, the last week has been all about mobile.

Foursquare 3.0

FoursquareAt SMX West on Tuesday, Foursquare's Tristan Walker gave a keynote where he talked about expanding Foursquare as a customer loyalty and acquisition platform for business. To that end, they've launched new social and engagement features (just in time for SXSW!).

How is this related to search? Here's the key sentence from Foursquare's 3.0 announcement:

For years we've wanted to build a recommendation engine for the real world by turning all the check-ins and tips we've seen from you, your friends, and the larger foursquare community into personalized recommendations.

Foursquare's new "explore" tab lets you search for anything you want (from "coffee" to "80s music") and provides results based on all the information Foursquare has at its disposal, including places your friends have visited and the time of day.

Google is trying to get in this space with Latitude and Hotpot. After all, how can Google possibly hope to offer the same quality search results for "wifi coffee" without data about what kinds of coffee houses you and your friends frequent most often? This is personalization based on overall behavior, not just online behavior, and it's both fascinating and creepy to think about the logical next steps.

Unfortunately for Google, they missed a huge opportunity to get in on this space early when they acquired Dodgeball and effectively killed it, causing the founders to leave Google and start Foursquare.

Bing is also investing in mobile/local search, the latest being "local deals" on iPhone and Android (although not yet on Windows mobile).

Where 2.0: 2011, being held April 19-21 in Santa Clara, Calif., will explore the intersection of location technologies and trends in software development, business strategies, and marketing.

Save 25% on registration with the code WHR11RAD

Continued growth in mobile

According to discussion from a recent local online advertising conference, mobile advertising could become the dominant form of online advertising by 2015. About 5% of paid search is currently mobile, and that number could double by year's end. Google has about 98% mobile search share in the United States and 97% of mobile search spend.

Google says mobile search accounts for 15% of their total searches, distributed as follows:

  • 30% - restaurants
  • 17% - autos
  • 16% - consumer electronics
  • 15% - finance and insurance
  • 15% - beauty and personal

Continued discussion of Google's "content farm" update

As discussed last week, Google's algorithm change impacted 12% of queries and the talk about it has not died down. I wrote a diagnostic guide about analyzing data and creating an action plan and Google opened a thread in their discussion forum to get feedback from site owners.

Self-driving cars!

OK, maybe this isn't really search, except that it's coming from Google, but it's self-driving cars! We live in the future!

Search Engine Land's Danny Sullivan took some video at TED of the cars in action, including some footage inside an actual self-driving car.

Surely flying cars are next.

Got news?

News tips are always welcome, so please send them along.


April 25 2010

Reconstructing Users' Web Histories From Personalized Search Results

An anonymous reader sends along this excerpt from MIT's Technology Review: "Personalization is a key part of Internet search, providing more relevant results and gaining loyal customers in the process. But new research highlights the privacy risks that this kind of personalization can bring. A team of European researchers, working with a researcher from the University of California, Irvine, found that they were able to hijack Google's personalized search suggestions to reconstruct users' Web search histories (PDF). Google has plugged most of the holes identified in the research, but the researchers say that other personalized services are likely to have similar vulnerabilities."

Read more of this story at Slashdot.

Reposted fromZaphod Zaphod
Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!