Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

January 09 2012

The hidden language and "wonderful experience" of product reviews

How do reviews, both positive and negative, influence the price of a product on Amazon? What phrases used by reviewers make us more or less likely to complete a purchase? These are some of the questions that computer scientist Panagiotis Ipeirotis, an associate professor at New York University's Stern School of Business, set out to investigate by analyzing the text in thousands of reviews on Amazon. Ipeirotis continues to research this space.

Ipeirotis' findings are surprising: consumers will pay more for the same product if the seller's reviews are good, certain types of negative reviews actually boost sales, and spelling plays an important role.

Our interview follows.

How important are product reviews on Amazon? Can they give sellers more pricing power? Ipeirotis: The reviews have a significant effect. When buying online, customers are not only purchasing the product, they're also inherently buying the guarantee of a seamless transaction. Customers read the feedback left from other buyers to evaluate the reputation of the seller. Since customers are willing to pay more to buy from merchants with a better reputation — something we call the "reputation premium" — that feedback tends to have an effect on future prices that the merchant can charge.

What are some of the most influential phrases?

Panagiotis Ipeirotis: "Never received" is a killer phrase in terms of reputation. It reduced the price a seller can charge by an average of $7.46 in the products examined. "Wonderful experience" is one of the most positive, increasing the price a seller can charge by $5.86 for the researched products.

How can very positive reviews be bad for sales?

Panagiotis Ipeirotis: Extremely positive reviews that contain no concrete details tend to be perceived as non-objective — written by fanboys or spammers. We observed this mainly in the context of product reviews, where superlative phrases like "Best camera!" with no further details are actually seen negatively.

Can a negative review ever be good for sales?

Panagiotis Ipeirotis: It can when the review is overly negative or criticizes aspects of the product that are not its primary purpose — the video quality in an SLR camera, for example. Or, when customers have unreasonable expectations: "Battery life lasts only for two days of shooting." Readers interpret these types of negative comments as "This is good enough for me," and it decreases their uncertainty about the product.

What is the effect of badly written reviews on sales?

Panagiotis Ipeirotis: Reviews containing spelling and grammatical errors consistently result in suboptimal outcomes, like lower sales or lower response rates. That was a fascinating but, in retrospect, expected finding. This holds true in a wide variety of settings, from reviews of electronics to hotels. It's even the case when examining email correspondence about a decision, such as whether or not to hire a contractor.

We don't know the exact reason yet, but the effect is very systematic. There are several possible explanations:

  • Readers think that the customers who buy this product are uneducated, so they don't buy it.
  • Reviews that are badly written are considered unreliable and therefore increase the uncertainty about the product.
  • Badly written reviews are unsuccessful attempts to spam and are a signal that even the other good reviews may not be authentic.

What's the relationship between the product attributes discussed in reviews and the attributes that lead to sales?

Panagiotis Ipeirotis: We observed that the aspects of a product that drive the online discussion are not necessarily the ones that define consumer decisions to buy it. For example, "zoom" tends to be discussed a lot for small point-and-shoot cameras. However, very few people are influenced by the zoom capabilities when it comes down to deciding which camera to buy.

This interview was edited and condensed.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20


November 01 2011

Demoting Halder: A wild look at social tracking and sentiment analysis

I've been holding conversations with friends around a short story I put up on the Web last week, Demoting Halder, and interesting reactions have come up. Originally, the story was supposed to lay out an alternative reality where social tracking and sentiment analysis had taken over society so pervasively that everything people did revolved around them. As the story evolved, I started to wonder whether the reality in the story was an alternative one or something we are living right now. I think this is why people have been responding to the story.

The old saying, "First impressions are important" is going out of date. True, someone may form a lasting opinion of you based on the first information he or she hears, but you no longer have control over what this first information is. Businesses go to great lengths to influence what tops the results in Google and other search engines. There are court battles over ads that are delivered when people search for product names--it's still unclear whether a company can be successfully sued for buying an ad for a name trademarked by a competitor. But after all this effort, someone may hear about you first on some forum you don't even know about.

In short, by the time people call you or send email, you have no idea what they know already and what they think about you. One friend told me, "Social networking turns the whole world into one big high school (and I didn't like high school)." Nearly two years ago, I covered questions of identity online, with a look at the effects of social networking, in a series on Radar. I think it's still relevant, particularly concerning the choices it raised about how to behave on social networks, what to share, and--perhaps most importantly--how much to trust what you see about other people.

Some people assiduously monitor what comes up when they Google their name or how many followers they have on various social networks. Businesses are springing up that promise even more sophisticated ways to rank people or organizations. Some of the background checking shades over into outright stalking, where an enemy digs up obscure facts that seem damaging and posts them to a forum where they can influence people's opinion of the victim. One person who volunteered for a town commission got on the wrong side of somebody who came before the commission, and had to cope with such retaliation as having pictures of her house posted online along with nasty comments. I won't mention what she found out when she turned the tables and looked the attacker up online. After hearing her real-life experiences, I felt like my invented story will soon be treated as a documentary.

And the success characters have gaming the system in Demoting Halder should be readily believable. Today we depend heavily on ratings even thought there are scads of scams on auction sites, people using link farms and sophisticated spam-like schemes to boost search results, and skewed ratings on travel sites and similar commercial ventures.

One friend reports, "It is amazing how many people have checked me and my company out before getting on the initial call." Tellingly, she goes on to admit, "Of course, I do the same. It used to be that was rare behavior. Now it is expected that you will have this strange conversation where both parties know way too much about each other." I'm interested in hearing more reactions to the story.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

October 14 2011

Visualization of the Week: Sentiment in the Bible

New textual analysis tools are providing interesting insights into classic works of literature. Last month, for example, we looked at a visualization based on character frequency in Jane Austen novels.

Along similar lines, has just released a visualization showing a sentiment analysis of the Bible.

A blog post announcing the visualization outlines the ebbs and flows that were uncovered:

Things start off well with creation, turn negative with Job and the patriarchs, improve again with Moses, dip with the period of the judges, recover with David, and have a mixed record (especially negative when Samaria is around) during the monarchy. The exilic period isn't as negative as you might expect, nor the return period as positive. In the New Testament, things start off fine with Jesus, then quickly turn negative as opposition to his message grows. The story of the early church, especially in the epistles, is largely positive.

Screenshot from's Bible sentiment visualization
This Bible visualization from includes both the Old and New Testaments. Black indicates a positive sentiment, red negative. (Click to enlarge.) created the visualization by running the Viralheat Sentiment API across a number of translations. The raw data from OpenBible's visualization is available for download.

A second visualization breaks down the sentiment by specific book, making it easier to see those that contain overwhelmingly positive sentiment (Psalms, for example), those that contain negative sentiment (Job), and those that go from bad to worse (Jonah).

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

More Visualizations:

May 04 2011

Trading on sentiment

Numbers on boardComputers don't get emotionally invested in financial trades, but they do take feelings seriously.

Case in point: The financial trading dashboard managed by Thomson Reuters uses sentiment analysis data from Lexalytics to track news on 20,000 stocks and thousands of commodities. The Lexalytics system parses text from multiple sources, looking for keywords, tone, relevance and freshness. The resulting textual analysis (the meaning of the text) and sentiment analysis (the emotions in the text) is then incorporated into widely used algorithmic trading systems.

Mark Thompson, CEO of McKinley Software (the parent company of Lexalytics), told me more about this emotion-to-data conversion. "Our financial engine is something we developed over an 8-year period, and the main partner for that is Thomson Reuters," Thompson said. "The Thomson Reuters news passes through our black box and we kick out scores based on 80 different variables for all of the articles."

Algorithmic trading is automated trading where trading software takes various inputs, or "trading signals," and uses them to decide what trades to make. Trades are executed in a matter of milliseconds and there is no human intervention. In 2009, algorithmic trading accounted for more than 25 percent of all shares traded on the buy side. No human being can read the latest financial news fast enough to contribute to those buy or sell decisions. That's where sentiment analysis comes in.

"By scoring the news, within milliseconds we get a very accurate view of what's being said about a particular stock or sector," Thompson said. "Thomson Reuters sells that output to trading houses who then plug this data into their algorithmic trading models. We have found that we can predict stock market movements. We provide an extra layer of richness that trading staff haven't been able to get their hands on. Otherwise, you are just doing very two-dimensional quant processing."

Rochester Cahan, VP of Global Equity Quantitative Strategy at Deutsche Bank, has been experimenting with the Thomson Reuters system. Cahan told me that he has seen significant improvements in trading performance when the text analysis and sentiment scores are used as trading inputs. In addition, the scores are uncorrelated with existing trading signals — in other words, they provide new information to the trading system.

The most positive sentiment levels (e.g. Apple releases the iPad to universal acclaim) are not necessarily the most useful for trading. The stock price reacts very quickly so it's difficult to take advantage of the information. However, Cahan said stocks with moderate positivity tend to be overlooked by the market and can make for good buys.

I asked Thompson about the limitations of the sentiment analysis technology. He explained that even human beings don't agree on the sentiment of an article more than about 85% of the time. "The problem with our kind of engine is trying to get above 85% accuracy," Thompson said. "Beyond that level, you get a diminishing return and you need more human intervention. This leaves the human analyst to pass different types of judgements."

The competitive edge may be lost if all trading systems use sentiment analysis, but Thompson thinks there is some distance to go before we get to that point. "Everyone has a slightly different way of composing the model and using the news, and there are always advances in the technology," he said. "But there will come a point when sentiment becomes an ordinary part of the trading mix."

Photo: ABOVE by Lyfetime, on Flickr


May 03 2011

Four short links: 3 May 2011

  1. SentiWordNet -- WordNet with hints as to sentiment of particular terms, for use in sentiment analysis. (via Matt Biddulph)
  2. Word Frequency Lists and Dictionaries -- also for text analysis. This site contains what we believe is the most accurate frequency data of English. It contains word frequency lists of the top 60,000 words (lemmas) in English, collocates lists (looking at nearby words to see word meaning and use), and n-grams (the frequency of all two and three-word sequences in the corpora).
  3. Crash Course in Web Design for Startups -- When I was a wee pixel pusher I would overuse whatever graphic effect I had just learned. Text-shadow? Awesome, let's put 5px 5px 5px #444. Border-radius? Knock that up to 15px. Gradients? How about from red to black? You can imagine how horrible everything looked. Now my rule of thumb in most cases is applying just enough to make it perceivable, no more. This usually means no blur on text-shadow and just a 1px offset, or only dealing with gradients moving between a very narrow color range. Almost everything in life is improved with this rule.
  4. Leafsnap -- Columbia University, the University of Maryland and the Smithsonian Institution have pooled their expertise to create the world’s first plant identification mobile app using visual search—Leafsnap. This electronic field guide allows users to identify tree species simply by taking a photograph of the tree’s leaves. In addition to the species name, Leafsnap provides high-resolution photographs and information about the tree’s flowers, fruit, seeds and bark—giving the user a comprehensive understanding of the species. iPhone for now, Android and iPad to come. (via Fiona Romeo)

March 31 2011

With sentiment analysis, context always matters

People are finding new ways to use sentiment analysis tools to conduct business and measure market opinion. But is such analysis really effective, or is it too subjective to be relied upon?

In the following interview, Matthew Russell (@ptwobrussell), O'Reilly author and principal and co-founder of Zaffra, says the quality of sentiment analysis depends on the methodology. Large datasets, transparent methods, and remembering that context matters, he says, are key factors.

What is sentiment analysis?

Matthew RussellMatthew Russell: Think of sentiment analysis as "opinion mining," where the objective is to classify an opinion according to a polar spectrum. The extremes on the spectrum usually correspond to positive or negative feelings about something, such as a product, brand, or person. For example, instead of taking a poll, which essentially asks a sample of a population to respond to a question by choosing a discrete option to communicate sentiment, you might write a program that mines relevant tweets or Facebook comments with the objective of scoring them according to the same criteria to try and arrive at the same result.

What are the flaws with sentiment analysis? How can something like sarcasm be addressed?

Matthew Russell: Like all opinions, sentiment is inherently subjective from person to person, and can even be outright irrational. It's critical to mine a large — and relevant — sample of data when attempting to measure sentiment. No particular data point is necessarily relevant. It's the aggregate that matters.

An individual's sentiment toward a brand or product may be influenced by one or more indirect causes &dmash; someone might have a bad day and tweet a negative remark about something they otherwise had a pretty neutral opinion about. With a large enough sample, outliers are diluted in the aggregate. Also, since sentiment very likely changes over time according to a person's mood, world events, and so forth, it's usually important to look at data from the standpoint of time.

As to sarcasm, like any other type of natural language processing (NLP) analysis, context matters. Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue, and sarcasm and other types of ironic language are inherently problematic for machines to detect when looked at in isolation. It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account. For example, that would require knowing that a particular user is generally sarcastic, ironic, or hyperbolic, or having a larger sample of the natural language data that provides clues to determine whether or not a phrase is ironic.

Is the phrase "sentiment analysis" being used appropriately?

Matthew Russell: I've never had a problem with the phrase "sentiment analysis" except that it's a little bit imprecise in that it says nothing about how the analysis is being conducted. It only describes what is being analyzed — sentiment. Given the various flaws I've described, it's pretty clear that the analysis techniques can sometimes be as subjective as the sentiment itself. Transparency in how the analysis occurs and additional background data — such as the context of when data samples were gathered, what we know about the population that generated them, and so forth — is important. Of course, this is the case for any test involving non-trivial statistics.

Sentiment analysis recently was in the news, touted as an effective tool for predicting stock market prices. What other non-marketing applications might make use of this sort of analysis?

Matthew Russell: The stock market prices is potentially a problematic example because it's not always the case that a company that creates happy consumers is necessarily profitable. For example, key decision makers could still make poor fiscal decisions or take on bad debt. Like anything else involving sentiment, you have to hold the analysis loosely.

A couple examples, though, might include:

  • Politicians could examine the sentiment of their constituencies over time to try and gain insight into whether or not they are really representing the interests that they should be. This could possibly involve realtime analysis for a controversial topic, or historical analysis to try and identify trends such as why a "red state" is becoming a "blue state," or vice-versa. (Sentiment analysis is often looked at as a realtime activity, but mining historical samples can be incredibly relevant too.)
  • Conference organizers could use sentiment analysis based on book sales or related types of data to identify topics of interest for the schedule or keynotes.

Of course, keep in mind that just because the collective sentiment of a population might represent what the population wants, it's not necessarily the case that it's in its best interests.


March 10 2011

Who is the champion of SXSW?

We have reviewed every SXSW twitter post from 2009, 2010 and 2011 to identify the show's biggest influencers.

This year, as in past years, Chris Brogan (@chrisbrogan) is the champion of champions for SXSW. For three years running, he's been the top influencer on PeopleBrowsr's SXSW Interest Graph. The king of connectedness has the most friends on Twitter discussing SXSW — a reigning title that resonates with his social media identity.

SXSW influential individuals year after year

SXSW influential companies year after year

Why Chris Brogan?

Chris has the highest number of followers who are interested in SXSW. His followers are having conversations about SXSW and often tweet SXSW mentions and news. Chris is an influencer for SXSW because he has a high number of engaged connections who are interested in this topic. He is a brand champion for SXSW because of his potential influence in the SXSW interest-based community.

We identify champions as people who have the most followers tweeting a topic of interest. The same analysis can be done for champions within locations or communities. Community champions are those people who have the most number of friends within a particular community who are talking about a particular topic.

We analyzed the list of all SXSW mentions to find the central influence connectors. Our goal was to discover how influencers discussing SXSW are connected to each other and which influencers are the most interconnected among the group. We checked every connection, frequency of conversation and engagement, and compared each person to everyone else in the list. This process was repeated for individuals in the global SXSW conversation, as well as the top communities to create a connections graph based on interest.

Chris is connected to the highest number of people in the SXSW champion community who are also discussing SXSW topics. His messages reach the highest number of people who are interested in SXSW.

What does he say?

We were interested in the content of Chris' messages and did human sentiment analysis to gather further insights into his influence. Chris' tweets mainly focus on awareness and capturing attention, reviewing emerging tech and startups, and big picture ideas. Chris is a positive tweeter — even his negative comments have a nice tone.

Here's another one of his tweets...

Most of his interaction on Twitter is with other tech influencers, social media experts and marketers who also have high follower counts and close connections. Chris is a highly influential trust agent in social media. He's a prolific tweeter, personal, approachable and actively engaged in conversations.

And he is not attending SXSW this year ...

Web 2.0 Expo San Francisco 2011, being held March 28-31, will examine key pieces of the digital economy and the ways you can use important ideas for your own success.

Save 20% on registration with the code WEBSF11RAD

Other champions

Perhaps next year we'll have a new reigning head of the Twitterverse. Here are a few other top champions we analyzed, including Liz Strauss, Robert Scoble, and Kevin Rose. Through human sentiment analysis we found no surprises — the traits that these champions have in common is that they retweet, share messages, respond in real-time and provide useful information on topics that are interesting to their followers.

As a champion, Liz Strauss uses Twitter to both broadcast and engage in conversation. She often retweets others and is mostly neutral — though her tone is authoritative and her style is honest. She has a lot of mentions about public speaking and she posts recommendations to help others improve in this area of expertise. Her tweets about SXSW focus on finding ways to maximize her conference time — and she has frequent conversations with other Twitter champions.

Robert Scoble is another veteran of SXSW, and it was no surprise that he'd be a top champion for the festivities. His 160,000+ followers are interested in technology news and social media. He mainly uses Twitter as a medium to engage with other geeks — he's active in @replies and takes the time to respond to people, regardless of their influence or follower count. He also seems to be sharing more than broadcasting. He has a fondness for startups and promotes and reviews new products often. Scoble has been tweeting a lot about SXSW this year yet his relative influence ranking was at its max in 2009.

Kevin Rose has more than 1.2 million followers. Reviewing his tweets with human sentiment analysis, we found that his positivity is off the charts. He's very conversational with the developer community and encouraging to people who are launching products/ideas. He loves to thank the community and to get involved. Though he rarely retweets, he replies to others frequently. He's also a dedicated sports fan and tweets a lot about food.

Rose will be a champion for many startups and will be at SXSW this year.

How we found these champions

We created SXSW Brand Champion Scorecards for 2011,
from global mentions of SXSW and invite you to walk the interest graph to see the connections of additional champions and the communities they influence.

The Brand Champion Scorecards and the Interest Graph are integrated with

Twitter has made it possible for people to openly make friends with others who have like-minded interests — regardless of first-degree personal connections. We follow people who are interested in the things we're interested in, and in many ways we are what we tweet.

We'd love to connect with you in Austin. Tweet me or @PriscillaScala or @Jen_Charlton and meet the team in person. We'll be tweeting throughout and following all of our SXSW champions.


December 10 2010

Strata Gems: Usahidi enables crowdsourced journalism and intelligence

We're publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Make beautiful graphs of your Twitter network.

Strata 2011 Usahidi is a software platform built to crowdsource information over multiple channels such as text messages, email and Twitter. Originally built in 2008 to map reports of post-election violence in Kenya, Usahidi has evolved into a non-profit company with a suite of tools that enables crowdsourced information aggregation, with applications ranging from citizen journalism and crisis management to the more commercial side of brand monitoring.

You can use the Usahidi tools in two ways: by downloading the source code and running it yourself, or by taking advantage of their hosted platforms SwiftRiver and Crowdmap.

CrowdMap is a minimum-fuss way to use the Usahidi tools to collect and visualize geographical data. Though built for emergency use, it has many applications for representation of local knowledge.

A free-to-use web hosted service, CrowdMap has been used to help with reporting floods in Pakistan, and to aggregate reports from citizen journalists about the Pope's visit to the UK.

UK Tube Strike Map
Crowdmap of the UK Tube strikes, created by the BBC

SwiftRiver is a media aggregation and filtering tool. It aggregates sources such as Twitter, blog, email and SMS, and provides features that help identify relationships and trends in the incoming data sets. Through semantic analysis, incoming content can be automatically categorized for review.

As well as data stream management and curation, SwiftRiver places an emphasis on adding
context and history to online research, enabling the location and reputation of data sources to
be taken into account.

You can use SwiftRiver as a hosted service, including a free individual plan, or download and run it yourself.

December 02 2010

Strata Gems: Use Wikipedia as training data

We'll be publishing a new Strata Gem each day all the way through to December 24. Yesterday's Gem: Try MongoDB without installing anything.

Strata 2011One of the most exciting analytical techniques is natural language processing and sentiment analysis. Given natural language text, can we use a computer to discover what's being said? Applications ranges from user interface through marketing and espionage.

The hard part of the problem is how do you teach a computer what words mean, and how do you figure out the context to select the right meaning for a word? The word "apple" could refer to the fruit, the computer company, or the Beatles' record label. Or a bank of the same name, the rock band, New York City, the singer Fiona Apple, the list goes on.

One answer is to use a classifier, which can differentiate between the different contexts in which a word is used in order to determine its sense. Most anti-spam filtering solutions use a classifier. Classifiers must be trained to be effective though, as anybody who has used anti-spam systems will tell you.

It's relatively easy to differentiate between spam and non-spam email, but how do you go about breaking down the English language to finding training data for each word sense?

Fortunately, there's a large open data source available that has put a lot of effort into the disambiguation of terms such as "apple" - Wikipedia. Data scientists often use information from Wikipedia to aid in the identification of real world entities in their work, and its use for disambiguation has been described in several reports, including this 2007 paper from Rada Mihalcea, Using Wikipedia for Automatic Word Sense Disambiguation (PDF).

Wikipedia front page

The key concept is that in the Wikipedia article for the Apple computer company, the world "apple" is used in the context of meaning the company, so you can use it to train natural language classifiers for that sense of the word. The Wikipedia article for apple the fruit offers a similar corpus for the fruity context, and so on. The Wikipedia URL for a particular concept is an unambiguous tag that you can then use to identify word sense.

Fortunately, you don't need to be a deep researcher to start using Wikipedia in this way. A recent blog post from Jim Plush shows how to use Wikipedia and Python to disambiguate words from Twitter posts. With a relatively brief Python script and training data culled from Wikipedia, Plush was able to distinguish between apple the fruit and Apple the company in the text of Twitter posts mentioning "apple".

For more information, check out the Python Natural Language Toolkit web site. Also, the Strata panel Online Sentiment, Machine Learning, and Prediction will dive into real world uses of sentiment analysis and machine learning.

August 31 2010

Four short links: 31 August 2010

  1. Rules for Revolutionaries -- Carl Malamud's talk to the WWW2010 Conference. Video, slides, and text available.
  2. Self-Improving Bayesian Sentiment Analysis for Twitter -- a how-I-did-it for a homegrown project to do sentiment analysis on Twitter.
  3. LUXR -- the Lean User Experience Residency program. LUXr brings user experience and design services to early stage teams in a lower cost, more efficient way than traditional project-based consulting. The latest from Adaptive Path's Janice Fraser.
  4. My Top Ten Assertions About Data Warehouses (CACM) -- Michael Stonebraker's take on the data warehouse world, and his predictions cut across a lot of our O'Reilly trends. Assertion 5: "No knobs" is the only thing that makes any sense. It is pretty clear that human operational costs dominate the cost of running a data warehouse. [...] Almost all DBMSs have 100 or more complicated tuning "knobs." This requires DBAs to be "4-star wizards" and drives up operating costs. Obviously, the only thing that makes sense is to have a program that adjusts these knobs automatically. In other words, look for "no knobs" as the only way to cut down DBA costs. (via mikeolson on Twitter)

December 22 2009

Being online: Your identity to advertisers--it's not all about you

Thy self thou gav'st, thy own worth then not knowing

(This post is the fourth in a series called "Being online: identity, anonymity, and all things in between.")

Voracious data foraging leads advertisers along two paths. One of
their aims is to differentiate you from other people. If vendors know
what condiments you put in your lunch or what material you like your
boots made from, they can pinpoint their ads and promotions more
precisely at you. That's why they love it when you volunteer that
information on your blog or social network, just as do the college
development staff we examined before.

The companies' second aim is to insert you into a group of people for
which they can design a unified marketing campaign. That is, in
addition to differentiation, they want demographics.

The first aim, differentiation, is fairly easy to understand. Imagine
you are browsing web sites about colic. An observer (and I'll discuss
in a moment how observations take place) can file away the reasonable
deduction that there is a baby in your life, and can load your browser
window with ads for diapers and formula. This is called behavioral

Since behavioral advertising is normally a pretty smooth operator, you
may find it fun to try a little experiment that could lift the curtain
on it bit. Hand your computer over for a few hours to a friend or
family member who differs from you a great deal in interests, age,
gender, or other traits. (Choose somebody you trust, of course.) Let
him or her browse the web and carry on his or her normal business.
When you return and resume your own regular activities, check the ads
in your browser windows, which will probably take on a slant you never
saw before. Of course, the marketers reading this article will be
annoyed that I asked you to pollute their data this way.

Experiences like this might arouse you to be conscious of every online
twitch and scratch, just as you may feel in real life in the presence
of a security guard whose suspicion you've aroused, or when on stage,
or just being a normal teenager. Online, paranoia is level-headedness.
Someone indeed is collecting everything they can about you: the amount
of time you spend on one page before moving on to the next, the links
you click on, the search terms you enter. But it's all being collected
by a computer, and no human eyes are ever likely to gaze upon it.

Your identity in the computerized eyes of the advertiser is a strange
pastiche of events from your past. As mentioned at the beginning of
the article, Google's Dashboard lets you see what Google knows about
you, and even remove items--an impressive concession for a company
that has mastered better than any other how to collect information on
casual Web users and build a business on it. Of course, you have to
establish an identity with them before you can check what they know
about your identity. This is not the last irony we'll encounter when
exploring identity.

But advertisers do more than direct targeting, and I actually find
the other path their tracking takes--demographic analysis--more
problematic. Let's return to the colicky baby example. Advertisers add
you to their collection of known (or assumed) baby caretakers and tag
your record with related information to help them understand the
general category of "baby care." Anything they know about your age,
income, and other traits helps them understand modern parenting.

As I
wrote over a decade ago,
this kind of data mining typecasts us and encourages us to head down
well-worn paths. Unlike differentiation, demographics affect you
whether or not you play the game. Even if you don't go online, the
activities of other people like you determine how companies judge your

The latest stage in the evolution of demographic data mining is
sentiment analysis, which trawls through social networking messages to
measure the pulse of the public on some issue chosen by the
researcher. A crude application of sentiment analysis is to search for
"love" or "hate" followed by a product trademark, but the natural
language processing can become amazingly subtle. Once the data is
parsed, companies can track, for instance, the immediate reaction to a
product release, and then how that reaction changed after a review or
ad was widely disseminated. Results affect not only advertising but
product development.

Once again, my reaction to sentiment analysis mixes respect for its
technical sophistication with worries about what it does to our
independence. If you add your voice to the Twittersphere, it may be
used by people you'll never know to draw far-reaching conclusions. On
the other hand, if you refuse to participate, your opinion will be

Google's Dashboard tells you only what they preserve on you
personally, not the aggregated statistics they calculate that
presumably include anonymous browsing. But you can peek at those as
well, and carry on some rough sentiment analysis of your own, through
Google Trends.

Considering all this demographic analysis (behavioral, sentiment, and
other) catapults me into a bit of a 21st-century-style existential
crisis. If a marketer is able to combine facts about my age, income,
place of birth, and purchases to accurately predict that I'll want a
particular song or piece of clothing, how can I flaunt my identity as
an autonomous individual?

Perhaps we should resolve to face the brave new world stoically and
help the companies pursue their goals. Social networking sites are
developing APIs and standards that allow you to copy information
easily between them. For instance, there are sites that let you
simultaneously post the same message instantly to both Twitter and
Facebook. I think we should all step up and use these services. After
all, if your off-the-cuff Tweet about your skis from the lounge of a
ski resort goes into planning a multimillion dollar campaign, wouldn't
it be irresponsible to send the advertiser mixed messages?

My call to action sounds silly, of course, because the data gathering
and analysis will obviously not be swayed by a single Tweet. In fact,
sophisticated forms of data mining depend on the recent upsurge of new
members onto the forums where the information is collected. The volume
of status messages has to be so high that idiosyncrasies get ironed
out. And companies must also trust that the margin of error caused by
malicious competitors or other actors will be negligible.

We saw in an earlier section that your online presence is signaled by
a slim swath of information. At the low end, marketers know only your
approximate location through your IP address. At the other extreme
they can feast on the data provided by someone who not only logs into
a site--creating a persistent identity--but fills out a form with
demographic information (which the vendor hopes is truthful).

As another example of modern data-driven advertising, Facebook
delivers ads to you based on the information you enter there, such as age
and marital status. A tech journal reported that

the Google Droid phone combines contacts from many sources
but I haven't experienced this on my Droid and I don't see
technically how it could be done.

Most browsing takes place in an identity zone lying between the IP
address and the filled-out profile. We saw this zone in my earlier
example from the coffee shop. The visitor does not identify himself,
but lets the browser accept a cookie by default from each site.

Each cookie--so long as you don't take action to remove one, as I did
in my experiment--is returned to the server that left it on your
browser. If you use a different browser, the server doesn't know
you're the same person, and if a family member uses your browser to
visit the same server, it doesn't know you're different people.

Because the browser returns the cookie only to servers from the same
domain--say, sent the cookie, your identity
is automatically segmented. Whatever knows about
you, and do not. Servers can
also subdivide domains, so that can use the
cookie to keep track of your preferred mail settings while serves meteorological information
appropriate for your location.

This wall between cookies would seem to protect your browsing and
purchasing habits from being dumped into a large vat and served up to
advertisers. But for every technical measure protecting privacy, there
is another technical trick that clever companies can use to breach
privacy. In the case of cookies, the trick exploits the ability of a
web to can display content from multiple domains simultaneously. Such
flexibility in serving domains is normally used (aside from tweaks to
improve performance) to embed images from one domain in a web page
sent by another, and in particular to embed advertising images.

Now, if advertisers all contract with a single ad agency, such as


(the biggest of the online ad companies), all the ads from different
vendors are served under the domain and can
retrieve the same cookie. You don't have to click on an ad for the
cookie to be returned. Furthermore, each ad knows the page on which it
was displayed.

Therefore, if you visit web pages about colic, skis, and Internet
privacy at various times, and if DoubleClick shows an ad on each page,
it can tell that the same person viewed those disparate topics and use
that information to choose ads for future pages you visit. In the
United States, unlike other countries, no laws prohibit DoubleClick
from sharing that information with anyone it wants. Furthermore, each
advertiser knows whether you click on their ad and what activity you
carry on subsequently at their site, including any purchases you make
and any personal information you fill out in a form.

Put it all together, and you are probably far from anonymous on the
Internet. In addition, a more recent form of persistent data,
controlled by the popular Flash environment through a technology
called local shared objects, makes promiscuous sharing easy and
removing the information much harder.

The purchase of DoubleClick in 2007 by Google, which already had more
information on individuals than anybody else, spurred a great protest
from the privacy community, and the FTC took a hard look before
approving the merger. A similar controversy may surround Google's
recently announced purchase of

which provides a service similar to DoubleClick for advertisers on
mobile phones.

So far I've just covered everyday corporate treatment of web browsing
and e-commerce. The frontiers of data mining extend far into
the rich veins of user content.

Deep packet inspection allows your Internet provider to snoop on your
traffic. Normally, the ISP is supposed to look only at the IP address
on each packet, but some ISPs check inside the packet's content for
various reasons that could redound to your benefit (if it squelches a
computer virus) or detriment (if it truncates a file-sharing session).
I haven't heard of any ISPs using this kind of inspection for
marketing, but many predictions have been aired that we'll cross that

Governments have been snooping at the hubs that route Internet traffic
for years. China simply blocks references to domains, IP addresses, or
topics it finds dangerous, and monitors individuals for other
suspected behavior. The Bush administration and American telephone
companies got into hot water for collecting large gobs of traffic
without a court order. But for years before that, the Echelon project
was filtering all international traffic that entered or left the US
and several of its allies.

One alternative to being tossed on the waves of marketing is to join
the experiments in Vendor Relationship Management (VRM), which I

covered in a recent blog
Although not really implemented anywhere yet, this movement holds out
the promise that we can put out bids for what we want and get back
proposals for products and services. Maybe VRM will make us devote
more conscious thinking to how we present ourselves online--and how
many selves we want to present. These are the subjects of the next section.

The posts in "Being online: identity, anonymity, and all things in between" are:

  1. Introduction

  2. Being online: Your identity in real life--what people know

  3. Your identity online: getting down to basics

  4. Your identity to advertisers: it's not all about you (this post)

  5. What you say about yourself, or selves (to be posted December 24)

  6. Forged identities and non-identities (to be posted December 26)

  7. Group identities and social network identities (to be posted December 28)

  8. Conclusion: identity narratives (to be posted December 30)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!