Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 14 2012

MySQL in 2012: Report from Percona Live

The big annual MySQL conference, started by MySQL AB in 2003 and run
by my company O'Reilly for several years, lives on under the able
management of Percona. This
fast-growing company started out doing consulting on MySQL,
particularly in the area of performance, and branched out into
development and many other activities. The principals of this company
wrote the most recent two editions of the popular O'Reilly book href="http://shop.oreilly.com/product/0636920022343.do">High
Performance MySQL
.

Percona started offering conferences a couple years ago and decided to
step in when O'Reilly decided not to run the annual MySQL conference
any more. Oracle did not participate in Percona Live, but has
announced href="http://www.oracle.com/us/corporate/press/1577449">its own MySQL
conference for next September.

Percona Live struck me as a success, with about one thousand attendees
and the participation of leading experts from all over the MySQL
world, save for Oracle itself. The big players in the MySQL user
community came out in force: Facebook, HP, Google, Pinterest (the
current darling of the financial crowd), and so on.

The conference followed the pattern laid down by old ones in just
about every way, with the same venue (the Santa Clara Convention
Center, which is near a light-rail but nothing else of interest), the
same food (scrumptious), the standard format of one day of tutorials
and two days of sessions (although with an extra developer day tacked
on, which I will describe later), an expo hall (smaller than before,
but with key participants in the ecosystem), and even community awards
(O'Reilly Media won an award as Corporate Contributor of the Year).
Monty Widenius was back as always with a MariaDB entourage, so it
seemed like old times. The keynotes seemed less well attended than the
ones from previous conferences, but the crowd was persistent and
showed up in impressive numbers for the final events--and I don't
believe it was because everybody thought they might win one of the
door prizes.

Jeremy Zawodny ready to hand out awards
Jeremy Zawodny ready to hand out awards.

Two contrasting database deployments

I checked out two well-attended talks by system architects from two

high-traffic sites: Pinterest and craigslist. The radically divergent
paths they took illustrate the range of options open to data centers
nowadays--and the importance of studying these options so a data
center can choose the path appropriate to its mission and
applications.

Jeremy Zawodny (co-author of the first edition of High Performance
MySQL
) href="http://www.percona.com/live/mysql-conference-2012/sessions/living-sql-and
-nosql-craigslist-pragmatic-approach">presented
the design of craigslist's site, which illustrates the model of
software accretion over time and an eager embrace of heterogeneity.
Among their components are:


  • Memcache, lying between the web servers and the MySQL database in
    classic fashion.

  • MySQL to serve live postings, handle abuse, data for monitoring
    system, and other immediate needs.

  • MongoDB to store almost 3 billion items related to archived (no longer
    live) postings.

  • HAproxy to direct requests to the proper MySQL server in a cluster.

  • Sphinx for text searches, with

    indexes over all live postings, archived postings, and forums.

  • Redis for temporary items such as counters and blobs.

  • An XFS filesystem for images.

  • Other helper functions that Zawodny lumped together as "async
    services."


Care and feeding of this menagerie becomes a job all in itself.
Although craigslist hires enough developers to assign them to
different areas of expertise, they have also built an object layer
that understands MySQL, cache, Sphinx, MongoDB. The original purpose
of this layer was to aid in migrating old data from MySQL to MongoDB
(a procedure Zawodny admitted was painful and time-consuming) but it
was extended into a useful framework that most developers can use
every day.

Zawodny praised MySQL's durability and its method of replication. But
he admitted that they used MySQL also because it was present when they
started and they were familiar with it. So adopting the newer entrants
into the data store arena was by no means done haphazardly or to try
out cool new tools. Each one precisely meets particular needs of the
site.


For instance, besides being fast and offering built-in sharding,
MongoDB was appealing because they don't have to run ALTER TABLE every
time they add a new field to the database. Old entries coexist happily
with newer ones that have different fields. Zawodny also likes using a
Perl client to interact with a database, and the Perl client provided
by MongoDB is unusually robust because it was developed by 10gen
directly, in contrast to many other datastores where Perl was added by
some random volunteer.

The architecture at craigslist was shrewdly chosen to match their
needs. For instance, because most visitors click on the limited set
of current listings, the Memcache layer handles the vast majority of
hits and the MySQL database has a relatively light load.


However, the MySQL deployment is also carefully designed. Clusters are
vertically partitioned in two nested ways. First, different types of
items are stored on separate partitions. Then, within each type, the
nodes are further divided by the type of query:

  • A single master to handle all writes.

  • A group for very fast reads (such as lookups on a primary key)

  • A group for "long reads" taking a few seconds


  • A special node called a "thrash handler" for rare, very complex
    queries

It's up to the application to indicate what kind of query it is
issuing, and HAproxy interprets this information to direct the query
to the proper set of nodes.

Naturally, redundancy is built in at every stage (three HAproxy
instances used in round robin, two Memcache instances holding the same
data, two data centers for the MongoDB archive, etc.).


It's also interesting what recent developments have been eschewed by
craigslist. The self-host everything and use no virtualization.
Zawodny admits this leads to an inefficient use of hardware, but
avoids the overhead associated with virtualization. For efficiency,
they have switched to SSDs, allowing them to scale down from 20
servers to only 3. They don't use a CDN, finding that with aggressive
caching and good capacity planning they can handle the load
themselves. They send backups and logs to a SAN.

Let's turn now from the teeming environment of craigslist to the
decidedly lean operation of Pinterest, a much younger and smaller
organization. As href="http://www.percona.com/live/mysql-conference-2012/sessions/scaling-pinterest">presented
by Marty Weiner and Yashh Nelapati, when they started web-scale
growth in the Autumn of 2011, they reacted somewhat like craigslist,
but with much less thinking ahead, throwing in all sorts of software
such as Cassandra and MongoDB, and diversifying a bit recklessly.
Finally they came to their senses and went on a design diet. Their
resolution was to focus on MySQL--but the way they made it work is
unique to their data and application.

They decided against using a cluster, afraid that bad application code
could crash everything. Sharding is much simpler and doesn't require
much maintenance. Their advice for implementing MySQL sharding
included:

  • Make sure you have a stable schema, and don't add features for a
    couple months.


  • Remove all joins and complex queries for a while.

  • Do simple shards first, such as moving a huge table into its own
    database.

They use Pyres, a
Python clone of Resque, to move data into shards.

However, sharding imposes severe constraints that led them to
hand-crafted work-arounds.

Many sites want to leave open the possibility for moving data between
shards. This is useful, for instance, if they shard along some
dimension such as age or country, and they suddenly experience a rush
of new people in their 60s or from China. The implementation of such a
plan requires a good deal of coding, described in the O'Reilly book href="http://shop.oreilly.com/product/9780596807290.do">MySQL High
Availability
, including the creation of a service that just
accepts IDs and determines what shard currently contains the ID.

The Pinterest staff decided the ID service would introduce a single
point of failure, and decided just to hard-code a shard ID in every ID
assigned to a row. This means they never move data between shards,
although shards can be moved bodily to new nodes. I think this works
for Pinterest because they shard on arbitrary IDs and don't have a
need to rebalance shards.

Even more interesting is how they avoid joins. Suppose they want to
retrieve all pins associated with a certain board associated with a
certain user. In classical, normalized relational database practice,
they'd have to do a join on the comment, pin, and user tables. But
Pinterest maintains extra mapping tables. One table maps users to
boards, while another maps boards to pins. They query the
user-to-board table to get the right board, query the board-to-pin
table to get the right pin, and then do simple queries without joins
on the tables with the real data. In a way, they implement a custom
NoSQL model on top of a relational database.

Pinterest does use Memcache and Redis in addition to MySQL. As with
craigslist, they find that most queries can be handled by Memcache.
And the actual images are stored in S3, an interesting choice for a
site that is already enormous.

It seems to me that the data and application design behind Pinterest
would have made it a good candidate for a non-ACID datastore. They
chose to stick with MySQL, but like organizations that use NoSQL
solutions, they relinquished key aspects of the relational way of
doing things. They made calculated trade-offs that worked for their
particular needs.

My take-away from these two fascinating and well-attended talks was
that how you must understand your application, its scaling and
performance needs, and its data structure, to know what you can
sacrifice and what solution gives you your sweet spot. craigslist
solved its problem through the very precise application of different
tools, each with particular jobs that fulfilled craigslist's

requirements. Pinterest made its own calculations and found an
entirely different solution depending on some clever hand-coding
instead of off-the-shelf tools.

Current and future MySQL

The conference keynotes surveyed the state of MySQL and some
predictions about where it will go.

Conference co-chair Sarah Novotny at keynote
Conference co-chair Sarah Novotny at keynotes.

The world of MySQL is much more complicated than it was a couple years
ago, before Percona got heavily into the work of releasing patches to
InnoDB, before they created entirely new pieces of software, and
before Monty started MariaDB with the express goal of making a better
MySQL than MySQL. You can now choose among Oracle's official MySQL
releases, Percona's supported version, and MariaDB's supported
version. Because these are all open source, a major user such as

Facebook can even apply patches to get the newest features.

Nor are these different versions true forks, because Percona and
MariaDB create their enhancements as patches that they pass back to
Oracle, and Oracle is happy to include many of them in a later
release. I haven't even touched on the commercial ecosystem around
MySQL, which I'll look at later in this article.

In his href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-mysql-
evolution">opening
keynote, Percona founder Peter Zaitsev praised the latest MySQL
release by Oracle. With graceful balance he expressed pleasure that
the features most users need are in the open (community) edition, but
allowed that the proprietary extensions are useful too. In short, he
declared that MySQL is less buggy and has more features than ever.


The href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-making-lamp-cloud">former
CEO of MySQL AB, Mårten Mickos, also found that MySQL is
doing well under Oracle's wing. He just chastised Oracle for failing
to work as well as it should with potential partners (by which I
assume he meant Percona and MariaDB). He lauded their community
managers but said the rest of the company should support them more.

Keynote by Mårten Mickos
Keynote by Mårten Mickos.

href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-new-mysql-cloud-ecosystem">Brian
Aker presented an OpenStack MySQL service developed by his current
employer, Hewlett-Packard. His keynote retold the story that had led
over the years to his developing href="https://launchpad.net/drizzle">Drizzle (a true fork of MySQL
that tries to return it to its lightweight, Web-friendly roots) and
eventually working on cloud computing for HP. He described modularity,
effective use of multiple cores, and cloud deployment as the future of
databases.

A href="http://www.percona.com/live/mysql-conference-2012/sessions/future-perfect
-road-ahead-mysql">panel
on the second day of the conference brought together high-level
managers from many of the companies that have entered the MySQL space
from a variety of directions in a high-level discussion of the
database engine's future. Like most panels, the conversation ranged
over a variety of topics--NoSQL, modular architecture, cloud
computing--but hit some depth only on the topic of security, which was
not represented very strongly at the conference and was discussed here
at the insistence of Slavik Markovich from McAfee.

Keynote by Brian Aker
Keynote by Brian Aker.


Many of the conference sessions disappointed me, being either very
high level (although presumably useful to people who are really new to
various topics, such as Hadoop or flash memory) or unvarnished
marketing pitches. I may have judged the latter too harshly though,
because a decent number of attendees came, and stayed to the end, and
crowded around the speakers for information.

Two talks, though, were so fast-paced and loaded with detail that I
couldn't possibly keep my typing up with the speaker.

One such talk was the href="http://www.percona.com/live/mysql-conference-2012/sessions/keynote-what-c
omes-next">keynote
by Mark Callaghan of Facebook. (Like the other keynotes, it should be
posted online soon.) A smattering of points from it:


  • Percona and MariaDB are adding critical features that make replication
    and InnoDB work better.

  • When a logical backup runs, it is responsible for 50% of IOPS.

  • Defragmenting InnoDB improves compression.

  • Resharding is not worthwhile for a large, busy site (an insight also
    discovered by Pinterest, as I reported earlier)


The other fact-filled talk was href="http://www.percona.com/live/mysql-conference-2012/sessions/using-nosql-in
nodb-memcached">by
Yoshinori Matsunobu of Facebook, and concerned how to achieve
NoSQL-like speeds while sticking with MySQL and InnoDB. Much of the
talk discussed an InnoDB memcached plugin, which unfortunately is
still in the "lab" or "pre-alpha" stage. But he also suggested some
other ways to better performance, some involving Memcache and others
more round-about:

  • Coding directly with the storage engine API, which is storage-engine
    independent.


  • Using HandlerSocket, which queues write requests and performs them
    through a single thread, avoiding costly fsync() calls. This can
    achieve 30,000 writes per second, robustly.

Matsunobu claimed that many optimizations are available within MySQL
because a lot of data can fit in main memory. For instance, if you
have 10 million users and store 400 bytes per user, the entire user
table can fit in 20 GB. Matsunobu tests have shown that most CPU time
in MySQL is spent in functions that are not essential for processing
data, such as opening and closing a table. Each statement opens a
separate connection, which in turn requires opening and closing the
table again. Furthermore, a lot of data is sent over the wire besides
the specific fields requested by the client. The solutions in the talk
evade all this overhead.


The commercial ecosystem

Both as vendors and as sponsors, a number of companies have always
lent another dimension to the MySQL conference. Some of these really
have nothing to do with MySQL, but offer drop-in replacements for it.
Others really find a niche for MySQL users. Here are a few that I
happened to talk to:

  • Clustrix provides a very
    different architecture for relational data. They handle sharding
    automatically, permitting such success stories as the massive scaling
    up of the social media site Massive Media NV without extra
    administrative work. Clustrix also claims to be more efficient by
    breaking queries into fragments (such as the WHERE clauses of joins)
    and executing them on different nodes, passing around only the data

    produced by each clause.

  • Akiban also offers faster
    execution through a radically different organization of data. They
    flatten the normalized tables of a normalized database into a single
    data structure: for instance, a customer and his orders may be located
    sequentially in memory. This seems to me an import of the document
    store model into the relational model. Creating, in effect, an object
    that maps pretty closely to the objects used in the application
    program, Akiban allows common queries to be executed very quickly, and
    could be deployed as an adjunct to a MySQL database.

  • Tokutek produced a drop-in
    replacement for InnoDB. The founders developed a new data structure
    called a fractal tree as a faster alternative to the B-tree structures
    normally used for indexes. The existence of Tokutek vindicates both

    the open source distribution of MySQL and its unique modular design,
    because these allowed Tokutek's founders to do what they do
    best--create a new storage engine--without needing to create a whole
    database engine with the related tools and interfaces it would
    require.

  • Nimbus Data Systems creates a
    flash-based hardware appliance that can serve as a NAS or SAN to
    support MySQL. They support a large number of standard data transfer
    protocols, such as InfiniBand, and provide such optimizations as
    caching writes in DRAM and making sure they write complete 64KB blocks
    to flash, thus speeding up transfers as well as preserving the life of
    the flash.


Post-conference events

A low-key developer's day followed Percona Live on Friday. I talked to
people in the Drizzle and
Sphinx tracks.

As a relatively young project, the Drizzle talks were aimed mostly at
developers interested in contributing. I heard talks about their
kewpie test framework and about build and release conventions. But in
keeping with it's goal to make database use easy and light-weight, the
project has added some cool features.

Thanks to a
JSON
interface
and a built-in web server, Drizzle now presents you with a
Web interface for entering SQL commands. The Web interface translates
Drizzle's output to simple HTML tables for display, but you can also
capture the JSON directly, making programmatic access to Drizzle
easier. A developer explained to me that you can also store JSON
directly in Drizzle; it is simply stored as a single text column and
the JSON fields can be queried directly. This reminded me of an XQuery
interface added to some database years ago. There too, the XML was
simply stored as a text field and a new interface was added to run the
XQuery selects.

Sphinx, in contrast to Drizzle, is a mature product with commercial
support and (as mentioned earlier in the article) production
deployments at places such as craigslist, as well as an href="http://shop.oreilly.com/product/9780596809539.do">O'Reilly
book. I understood better, after attending today's sessions, what
makes Sphinx appealing. Its quality is unusually high, due to the use

of sophisticated ranking algorithms from the research literature. The
team is looking at recent research to incorporate even better
algorithms. It is also fast and scales well. Finally, integration with
MySQL is very clean, so it's easy to issue queries to Sphinx and pick
up results.

Recent enhancements include an href="https://github.com/alexksikes/fSphinx">add-on called fSphinx
to make faceted searches faster (through caching) and easier, and
access to Bayesian Sets to find "items similar to this one." In Sphinx
itself, the team is working to add high availability, include a new
morphology (stemming, etc.) engine that handles German, improve
compression, and make other enhancements.


The day ended with a reception and copious glasses of Monty Widenius's
notorious licorice-flavored vodka, an ending that distinguishes the
MySQL conference from others for all time.

February 15 2012

What the data can tell us about dating and other social congregation

Valentine's Day turned out to be a good time to discuss data crunching of online dating. Kevin Lewis, a PhD candidate in sociology and Berkman Center Fellow, drew an overflow room today for his talk Mate Choice in an Online Dating Site. It's yet another example of how, as people go online, they leave a trail of data that could never be captured before.

Here are some examples how traditional researchers are restricted:

  • They can get marriage data, but have much less data about dating, cohabitation without marriage, and other non-traditional arrangements that are increasingly common. Dating sites let us in at a much earlier stage in a relationship that may or may not lead to marriage.

  • They can measure certain recorded demographics such as age and race, but miss a huge range of criteria by which people evaluate potential mates. People enter lots of interesting facts about themselves and their hoped-for mates on dating sites.

  • Because researchers miss the initial contacts, they have trouble tracing back from a result (marriage) to the criteria used by the dating couples.

As an example of the the last problem, Lewis mentioned the observation that people usually date and marry others with similar levels of formal education. Actually, researchers have long hypothesized that men don't care much about women's educational levels. They would be willing to date and marry outside their educational levels. It's the women who care, and since they rule out men with much higher or lower educational levels, we end up with the current results.

Now Lewis can cite concrete data proving that hypothesis. On a dating site, men initiate and respond to contacts with women of many different levels. But the women don't initiate many contacts outside their own level, and don't respond to contacts from men outside that level.

How did Lewis conduct his research? Briefly, he persuaded OkCupid to give him a large data set stripped of free-text fields, but containing information on race, religion, and several other criteria. He chose data in the New York City area for heterosexual couples. Considering that 22% of heterosexual adults have found their current partners through online sites (the figure is even higher for same-sex couples: 61%), this is a lot of valuable data.

Of course, there are risks in extrapolating from this data set. Admittedly, OkCupid users tend to be younger and more Internet-savvy than the overall dating population. It's hard to tell whether some criterion is truly a determining factor or a consequence of some other factor (for instance, educational level is correlated with age). Still, Lewis controlled for variables a good deal and feels there is a lot of statistical validity to his findings.

As just one other example, he documented a lot of contacts across racial lines, more than one might expect. But there were definite patterns. For instance, black women received a lot fewer contacts from other races than most groups. In this way, the data on dating gives us a look at our values in choices in other forms of social interaction, not just romance.

December 10 2011

HealthTap's growth validates hypotheses about doctors and patients

A major round of funding for HealthTap gave me the opportunity to talk again with founder Ron Gutman, whom I interviewed earlier this year. You can get an overview of HealthTap from that posting or from its own web site. Essentially, HealthTap is a portal for doctors to offer information to patients and potential patients. In this digital age, HealthTap asks, why should a patient have to make an appointment and drive to the clinic just to find out whether her symptoms are probably caused by a recent medication? And why should a doctor repeat the same advice for each patient when the patient can go online for it?

Now, with 6,000 participating physicians and 500 participating health care institutions, HealthTap has revealed two interesting and perhaps unexpected traits about doctors:

  • Doctors will take the time to post information online for free. Many observations, including my own earlier posting, questioned whether they'd take the time to do this. The benefits of posting information is that doctors can demonstrate their expertise, win new patients, and cut down on time spent answering minor questions.

  • Doctors are willing to rate each other. This can be a surprise in a field known for its reluctance to break ranks and doctors' famous unwillingness to testify in malpractice lawsuits. But doctors do make use of the "Agree" button that HealthTap provides to approve postings by other doctors. When they press this button, they add the approved posting to their own web page (Virtual Practice), thus offering useful information to their own patients and others who can find them through search engines and social networks. The "Agree" ratings also cause postings to turn up higher when patients search for information on HealthTap, and help create a “Trust Score” for the doctor.

HealthTap, Gutman assures me, is not meant to replace doctors' visits, although online chats and other services in the future may allow patients to consult with doctors online. The goals of HealthTap remain to the routine provision of information that's easy for doctors to provide online, and to make medicine more transparent so patients know their doctors, before treatment and throughout their relationships.

HealthTap has leapt to a new stage with substantial backing from Tim Chang (managing director of Mayfield Fund), Eric Schmidt (through his Innovation Endeavors) and Rowan Chapman (Mohr Davidow Ventures). These VCs provide HealthTap with the funds to bring on board the developers, as well as key product and business development hires, required to scale up its growing operations. These investors also lend the business the expertise of some of the leaders in the health IT industry.

November 01 2011

Demoting Halder: A wild look at social tracking and sentiment analysis

I've been holding conversations with friends around a short story I put up on the Web last week, Demoting Halder, and interesting reactions have come up. Originally, the story was supposed to lay out an alternative reality where social tracking and sentiment analysis had taken over society so pervasively that everything people did revolved around them. As the story evolved, I started to wonder whether the reality in the story was an alternative one or something we are living right now. I think this is why people have been responding to the story.

The old saying, "First impressions are important" is going out of date. True, someone may form a lasting opinion of you based on the first information he or she hears, but you no longer have control over what this first information is. Businesses go to great lengths to influence what tops the results in Google and other search engines. There are court battles over ads that are delivered when people search for product names--it's still unclear whether a company can be successfully sued for buying an ad for a name trademarked by a competitor. But after all this effort, someone may hear about you first on some forum you don't even know about.

In short, by the time people call you or send email, you have no idea what they know already and what they think about you. One friend told me, "Social networking turns the whole world into one big high school (and I didn't like high school)." Nearly two years ago, I covered questions of identity online, with a look at the effects of social networking, in a series on Radar. I think it's still relevant, particularly concerning the choices it raised about how to behave on social networks, what to share, and--perhaps most importantly--how much to trust what you see about other people.

Some people assiduously monitor what comes up when they Google their name or how many followers they have on various social networks. Businesses are springing up that promise even more sophisticated ways to rank people or organizations. Some of the background checking shades over into outright stalking, where an enemy digs up obscure facts that seem damaging and posts them to a forum where they can influence people's opinion of the victim. One person who volunteered for a town commission got on the wrong side of somebody who came before the commission, and had to cope with such retaliation as having pictures of her house posted online along with nasty comments. I won't mention what she found out when she turned the tables and looked the attacker up online. After hearing her real-life experiences, I felt like my invented story will soon be treated as a documentary.

And the success characters have gaming the system in Demoting Halder should be readily believable. Today we depend heavily on ratings even thought there are scads of scams on auction sites, people using link farms and sophisticated spam-like schemes to boost search results, and skewed ratings on travel sites and similar commercial ventures.

One friend reports, "It is amazing how many people have checked me and my company out before getting on the initial call." Tellingly, she goes on to admit, "Of course, I do the same. It used to be that was rare behavior. Now it is expected that you will have this strange conversation where both parties know way too much about each other." I'm interested in hearing more reactions to the story.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

April 22 2011

HealthTap explores how big a community you need to crowdsource health information

A new company named HealthTap has just put together an intriguing combination of crowdsourced health advice and community-building. They rest their business case on the proposition that a battalion of doctors and other health experts--starting at the launch with 580 obstetricians and pediatricians--can provide enough information to help people make intelligent decisions. For me, although the venture is worthy in itself, it offers a model of something that might be even better as a national or international effort.

I had a chance just before the launch to visit the HealthTap office and get a pitch from the boisterous and enthusiastic Ron Gutman. The goal of HealthTap is to help ordinary people--in its first project, pregnant women and new mothers--answer basic questions such as, "What possible conditions match this symptom?" or "What should I do about this problem with a baby?" They don't actually provide health care online, but they provide information directly from doctors on the health issues that concern individuals.

The basic goals of HealthTap's "medical expert network" are:

  • To improve the public's understanding of their own health needs by providing precise, targeted information.

  • To engage patients, making them more interested in taking care of themselves.

  • To allow doctors to share their knowledge with communities, including current and prospective patients, by publicizing answer and tips they often share with patients.

  • To increase the efficiency and effectiveness of healthcare by helping users record personal information while researching their concerns before doctor visits.

  • To promote good doctors and create networks around caring

HealthTap needs to bring two populations online to succeed: health care provides and individuals (which I will try to avoid calling patients, because individuals can use health information without suffering from a specific complaint). I'll explain how they attract each population and what they offer these populations. Then I'll explore three challenges suggesting that HealthTap is best seen as a model for a national program: motivation, outreach, and accuracy.

Signing up health care providers

As I mentioned, HealthTap is demonstrating is viability already, boasting of 580 obstetricians and pediatricians. These doctors answer questions from patients, creating a searchable database. (And as we'll see in the next section, the customization for individuals goes far beyond search results.) Doctors can also write up tips for individuals. If a doctor finds she is handing out the same sheet to dozens of patients, she might as well put it online where she can refer her own patients and others can also benefit.

Doctors can be followed by patients and other doctors, an application of classic social networking. Doctors can also recommend other doctors. As with cheat sheets, doctor recommendations reflect ordinary real-life activities. Each doctor has certain specialists she recommends, and by doing it on HealthTap she can raise the general rating of these specialists.

Signing up individuals

Anyone can get a HealthTap account. Although you don't need to answer many questions, the power of customization provides an incentive to fill in as much information about you as you can. The pay-off is that when you search for a symptom or other issue--for instance, "moderate fever"--you will get results tailored to your medical condition, your age, the state of your pregnancy, and so on. A demo I saw at the HealthTap office suggested that the information brought up by a search there is impressively relevant, unlike a typical search for symptoms on Google or Bing.

HealthTap then goes on to ask a few relevant questions to help refine the information it provides even further. For a fever, it may ask what your temperature is and whether you feel pain anywhere. As explained earlier, it doesn't end up giving medical advice, but it does list the percentage match between symptoms reported and a list of possible conditions in its comprehensive database. All these heuristics--the questions, the list of conditions, the probabilities, the other suggestions--derive from the information entered by the providers in the HealthTap network and data published in peer-reviewed medical journals.

Some natural language processing is supported, letting HealthTap interpret questions such as "Should I eat fish?" Gutman ascribed their capabilities to a comprehensive medical knowledgebase built around a medical ontology that is designed especially for use by the lay public, coupled with a Bayesian (probabilistic) reasoning engine driving user interactions that lead to normative value-based choices.

HealthTap lets individuals follow the doctors whose advice they find helpful, and connect to individuals with medical conditions like theirs. Thus, HealthTap incorporates two of the key traits of social networks: forming communities and raising the visibility of doctors who provide useful information.


HealthTap uses easy-to-understand data visualization techniques that distill medical content into simple visual elements that intuitively communicate data that otherwise could seem complex (such as the symptoms, risk factors, tests, and treatments related to a condition).

So now let's look at the challenges HealthTap faces.

I'll mention at the outset that HealthTap doesn't have as much of a problem of trust as most social networks. Doctors have passed a high threshold and start out with the expectation of competence and commitment to their patients. This isn't always borne out by individual doctors, of course, but if you put 580 of them together you'll end up with a good representation of the current state of medical knowledge.

The challenge of motivation

Time will tell whether HealthTap can continue to attract physicians. Every posting by a physician is an advertising opportunity, and the chances of winning new patients--which can be done both by offering insights and by having colleagues recommend you--may motivate doctors to join and stay active. But I suspect that the most competent and effective doctors already have more patients banging down the door than they can handle, so they might stay away from HealthTap. Again, I like the HealthTap model but wish it could be implemented on a more universal scale.

The challenge of outreach

Pregnancy and childbirth was a clever choice for HealthTap's launch. Each trimester--not to mention the arrival of a newborn--feels like a different phase of life, and parents are always checking other people for insights. Still, the people who sign up for HealthTap are the familiar empowered patients, the ones with time and education, those who take the effort to take care of themselves and are likely to be big consumers of information in other formats such as Wikipedia (which has quite high-quality health information), books, courses, and various experts in fields related to birth. HealthTap will have more impact on health if doctors can persuade other people to sign up--those who don't eat right, who consume dangerous stimulants, etc.

And if HealthTap or something like it were a national program--if everybody was automatically signed up for a HealthTap account--we might have an impact on people who desperately need a stronger connection with their doctor and encouragement to lead healthier lives. The psychiatric patients who go off their meds, the diabetics with bad diets, the people with infections who fail to return for health checks--these could really benefit from the intensive online community HealthTap provides.

The challenge of accuracy

Considering that most people depend on the advice of one or two doctors for each condition treated, the combination of insights from 580 doctors should improve accuracy a lot. Still, clinical decision engines are a complex field. It seems to me that a service like HealthTap would be much more trustworthy if it represented a national crowdsourcing effort. Every doctor could benefit from a clinical decision support system, and the current US health care reform includes the use of such systems. I'd like to see a large one that HealthTap could tap into.

Gutman says that HealthTap has indeed started to work with other institutions and innovators. "We have worked with the U.S. Department of Health and Human Services to help make the public health data they released easy to access and useful for developers, and last year we held the first-ever 'health hackathon,' where hundreds gathered to build apps using the newly released health data we organized. We can become a platform for a world of health apps tied to our growing, open data, created by leading physicians and validated research."

April 01 2011

Open Media Boston forum examines revolution and Internet use in Middle East

What is the role of social media, and the Internet generally, in the
current Arab revolutions? Barrels of ink have been spilled over this
question (more than I have read, admittedly) but one could get the
real low-down from a href="http://www.openmediaboston.org/node/1721">forum I attended
last night at Lesley University.
Two of the presenters--Ethan Zuckerman and Jillian York--are connected
to the Berkman Center at Harvard, just a few blocks down the street,
while the third Suren Moodliar, works with two political advocacy
groups, Massachusetts Global Action and the Majority Agenda Project.
The forum was put together by href="http://openmediaboston.org/">Open Media Boston.

I came away convinced that Internet sites--Facebook in
particular--were crucial to the spread of the revolutions. In the long
run, a larger role was played by the traditional media (if one can
call Al Jazeera traditional, in the sense that they're a professional
broadcast organization), but they would never have gotten their hands
on the story of the Tunisian protests had not a cousin of the renowned
Mohamed Bouazizi released videos of protests on Facebook. According to
Zuckerman, this is what set Bouazizi's horrendous suicide apart from
similar acts of protest that had dotted Tunisia before.

Many worthy questions were raised and debated--and sometimes even
answered--during the two hours we occupied the hall, but one of the
most interesting concerned the most essential political question
facing the Internet: does it favor the grass-roots or the centers of
social control in government and business?

Many take the position extensively written on by Evgeny Morozov, that
the governments and other central powers will use the Internet for
surveillance and control. York underscored the dangers here by
reporting that Syrian officials have arrested protestors and forced
them to give up their passwords on popular social networks. In this
way the government can find out all their contacts and what they're
saying to them--even use their accounts to post pro-government
propaganda. One can extrapolate from this activity to a possible
future of an Internet panopticon where everything you want to say to
anyone must use the Internet.

York also reported the ironic backfiring (corroborated by an audience
member) of Egypt's famed Internet shutdown. By the time of the
shutdown, huge numbers of Egyptians were following the protestors in
Tahrir Square, comfortable with learning what was going on from their
photos and postings, perhaps even writing blogs from the comfort of
their homes or cafes. When the Internet went down (mostly--one out of
Egypt's eight Internet providers refused the order to shut down, and
it was a small local company, not one of the international biggies),
many of the people who wanted to follow events remotely decided to go
to Tahrir Square, just to keep their connection with the historic
events of the day. No wonder Zuckerman spoke of a "dictator's dilemma"
in deciding how much to censor the Internet.

Other issues included the double-edged sword of U.S. government
support for alternative media, the participation of women in the
revolutions both online and off, the relationship between Internet
penetration and the potential for mass mobilization, and the relative
importance of Internet users--mostly middle-class and educated--among
the much larger and sometimes militant population.

Certainly, non-violent revolutions have occurred prior to widespread
Internet access. Comparisons are often made between the current Arab
events and the fall of Communist regimes in Eastern Europe and the
Soviet Union--political events heavily intertwined with media
activity, but not much of the Internet kind--and I thought also of the
massive shift in a mostly Internet-free Latin America over the past
thirty years toward democracy, indigenous people's power, and civil
liberties. Revolutions are messy, and observers can argue about
causes and contributing factors for centuries. When you're in the
middle of one, you use any tools you can get your hands on.

March 02 2011

Media old and new are mobilized for effective causes

The bright light of social media has attracted the attention of followers in every discipline, from media and academia to corporate marketing and social causes. There was something for everybody today in the talk by researcher Sasha Costanza-Chock at Harvard's Berkman center on Transmedia Mobilization. He began with a stern admonition to treat conventional and broadcast media as critical resources, and moved ultimately to a warning to treat social networks and other peer-driven media as critical resources. I hope I can reproduce at least a smidgen of the insights and research findings he squeezed into forty-five minute talk (itself a compressed version of a much longer presentation he has delivered elsewhere).

Control the message, control the funding

Consultants (not normally known for welcoming a wide range of outside opinions themselves) have been browbeating the corporations and governments over social media, trying to get it through their noggins that Twitter and Facebook are not merely a new set of media outlets to fill with their PR and propaganda. Corporations and governments are notoriously terrified of losing control over "the message," but that is the only way they will ever get a message out in the new world of peer-to-peer communications. According to Costanza-Chock, non-profit social causes suffer from the same skittishness. "Some of them will go to the bitter end trying to maintain top-down control," he laughed. But "the smart ones" will get the point and become equal participants in forums that they don't try to dominate.

Another cogent observation, developed further in his discussion with the audience, drew a line between messaging and funding. Non-profits depend on foundations and other large donors, and need to demonstrate to them that the non-profit has actually done something. "We exchanged messages with 100,000 participants on MySpace" comes in sounding worth a lot less than, "We shot three documentaries and distributed press releases to four hundred media outlets." Sasha would like to see forums on social media for funders as well as the non-profits they fund. All sides need to learn the value of being peers in a distributed system, and how to use that role effectively.

Is the Internet necessary?

Sasha's key research for this talk involved the 2006 pro-immigrant demonstrations that played a role in bringing down the Sensenbrenner Bill that would have imposed severe restrictions on citizens in an attempt to marginalize and restrict the movements of immigrants. The protests filled the streets of major cities across the country, producing the largest demonstrations ever seen on U.S. soil. How did media play a role?

Sasha started by seemingly downplaying the importance of Internet-based media. He went over recent statistics about computer use, broadband access, and cell phone ownership among different demographics and showed that lower-class, Spanish-speaking residents (the base of the protestors in Los Angeles, where he carried out his research) were woeful under-represented. It would appear that the Internet was not a major factor in the largest demonstrations in U.S. history. But he found that they played a subtle role in combination with traditional media.

Immigrants are also largely shut out of mainstream media; it's a red-letter day when a piece about their lives or written from their point of view appears on page 10 of the local paper. Most of the mobilization, therefore, Sasha attributed to Spanish talk radio, which Los Angeles immigrants turned on all day and whose hosts made a conscious decision to work together and promote social action around the Sensenbrenner Bill.

Sasha also discovered other elements of traditional media, such as a documentary movie about Latino protests in 1960s Los Angeles that aired shortly before the demonstrations. And here's where social media came in: high school students who played roles in the documentary posted clips of their parts on MySpace. There were other creative uses of YouTube and the social media sites to spread the word about protests. Therefore, the Internet can't be dismissed. It could not have done much without material from traditional media to draw on, but the traditional media would not have had such a powerful effect without the Internet either.

One interesting aspect of Sasha's research concerned identity formation. You can't join with people in a cause unless you view the group as part of your identity, and traditional media go a long way to helping people form their identities. Just by helping to make a video, you can start to identify with a cause. It's interesting how revolutionaries in countries such as Tunisia and Egypt formed identities as a nation in opposition to their leaders instead (as most dictators strive to achieve) in sympathy with them. So identity formation is a critical process, and we don't know yet how much social networks can do to further it.

In conclusion, it seems that old and new media will co-exist for an indefinite period of time, and will reinforce each other. Interesting questions were raised in the audience about whether the new environment can create meeting spaces where people on opposing sides can converse productively, or whether it will be restricted to the heavily emotion-laden and one-sided narratives that we see so much of nowadays. One can't control the message, but the message can sure be powerful.

November 16 2010

Active Facebook users by region: November, 2010

With Facebook unveiling an integrated messaging system for its more than 500 million users, I decided to update a few charts that breakdown its users by region.

I. Percentage share of active users (weekly): note the steady rise in the share of users from Asia.

pathint

II. Market Penetration: Less than 3% penetration in Facebook's high-growth regions in Asia & Africa.

pathint


III. Percent Share of each Age Group (within each Region): Relative to the U.S. (33%), the share of users ages 18-25 continues to be higher in Asia (44%), Africa (41%), and the Middle East / N. Africa (39%).

pathint


November 09 2010

July 29 2010

Which Social Gaming companies are Hiring

Disney's announced purchase of Mountain View gaming startup Playdom, follows on the heels of EA's purchase of London-based Playfish last November. Based on active users Zynga remains by far the biggest online social gaming company, but what other independent companies are growing?


To see which companies are expanding, I used our data warehouse of online job postings1 to detect recent hiring2. Zynga and Playdom put out the most job postings over the last three months, with (Redwood City startup) Watercooler finishing a distant third3:



pathint


While I focused on online social gaming companies, I checked to see which companies were showing interest in games for smartphones, and found not too many were mentioning the iPhone or Android platforms on their job posts. Outside of Zynga, Playdom and Popcap Games, none of the other companies had (many) job postings that mentioned the iPhone/iPad or Android platforms4:


pathint






(1) Data for this post if for U.S. online job postings through 7/25/2010 and is maintained in partnership with SimplyHired.com. We use algorithms to dedup job posts: a single job posting can contain multiple jobs and appear on multiple job sites.

(2) Online job postings are from thousands of sources, and there are no standardized data formats (e.g., a field for company name). I quickly normalized company names for this post, but the results remain best approximations.

(3) Our data is for U.S. online job postings, so does not reflect hiring for overseas subsidiaries (e.g., Playfish/EA is based in London). Moreover, we did not include social gaming companies based outside the U.S. In the Facebook ecosystem, some of the top gaming companies have headquarters in East Asia and South America.

(4) iPhone does seem to be the (smartphone) platform of choice for these companies. Of the Jan-to-Jul 2010 job posts placed by the companies listed above, 23% mentioned the iPhone/iPad and only 2% mentioned the Android.

July 21 2010

Where Facebook's half a billion users reside

Facebook announced this morning that they now reach 500 million active users (just five and half years after launching). But where do these half a billion users reside? Refreshing my post from February, the share of users from Asia continues to rise and now stands at 17% of all Facebook users1.


pathint




Africa is the other fast-growth region and I'm expecting the region's share of active Facebook users to rise sharply over the next year. In terms of market potential, the number of active users in Asia is 2.3% of the population (1% in Africa) so the company still has lots of growth potential in the region:

pathint


The share of users age 18-25 remains higher in regions outside the U.S., especially in Asia, the Middle East / North Africa, Africa, and South America. 14% of users in the U.S. are 55 or older, the corresponding figure in Asia is 2% (in Europe and Africa it's 6%, in the M. East/N. Africa it's 3%, in South America it's 4% ). [For recent growth rates by age group, click HERE.]

pathint


Over the past 12 weeks, Facebook added over a million active users in fourteen countries, including 5 in Asia and all three members of NAFTA:

pathint


I close with a list of countries where Facebook is growing the fastest2:

pathint



(1) Data for this post is through the week ending 7/18/2010 and covers the 180+ countries where Facebook has a presence.

(2) Measured in terms of percentage change in active users over the last 12 weeks.

March 29 2010

The State of the Internet Operating System

I've been talking for years about "the internet operating system", but I realized I've never written an extended post to define what I think it is, where it is going, and the choices we face. This is that missing post. Here you will see the underlying beliefs about the future that are guiding my publishing program as well as the rationale behind conferences I organize like the Web 2.0 Summit and Web 2.0 Expo, the Where 2.0 Conference, and even the Gov 2.0 Summit and Gov 2.0 Expo.


Ask yourself for a moment, what is the operating system of a Google or Bing search? What is the operating system of a mobile phone call? What is the operating system of maps and directions on your phone? What is the operating system of a tweet?


On a standalone computer, operating systems like Windows, Mac OS X, and Linux manage the machine's resources, making it possible for applications to focus on the job they do for the user. But many of the activities that are most important to us today take place in a mysterious space between individual machines. Most people take for granted that these things just work, and complain when the daily miracle of instantaneous communications and access to information breaks down for even a moment.


But peel back the covers and remember that there is an enormous, worldwide technical infrastructure that is enabling the always-on future that we rush thoughtlessly towards.


When you type a search query into Google, the resources on your local computer - the keyboard where you type your query, the screen that displays the results, the networking hardware and software that connects your computer to the network, the browser that formats and forwards your request to Google's servers - play only a small role. What's more, they don't really matter much to the operation of the search - you can type your search terms into a browser on a Windows, Mac, or Linux machine, or into a smartphone running Symbian, or PalmOS, the Mac OS, Android, Windows Mobile, or some other phone operating system.


The resources that are critical to this operation are mostly somewhere else: in Google's massive server farms, where proprietary Google software farms out your request (one of millions of simultaneous requests) to some subset of Google's servers, where proprietary Google software processes a massive index to return your results in milliseconds.


Then there's the IP routing software on each system between you and Google's data center (you didn't think you were directly connected to Google did you?), the majority of it running on Cisco equipment; the mostly open source Domain Name System, a network of lookup servers that not only allowed your computer to connect to google.com in the first place (rather than typing an IP address like 74.125.19.106), but also steps in to help your computer access whatever system out there across the net holds the web pages you are ultimately looking for; the protocols of the web itself, which allow browsers on client computers running any local operating system (perhaps we'd better call it a bag of device drivers) to connect to servers running any other operating system.


You might argue that Google search is just an application that happens to run on a massive computing cluster, and that at bottom, Linux is still the operating system of that cluster. And that the internet and web stacks are simply a software layer implemented by both your local computer and remote applications like Google.


But wait. It gets more interesting. Now consider doing that Google search on your phone, using Google's voice search capability. You speak into your phone, and Google's speech recognition service translates the sound of your voice into text, and passes that text on to the search engine - or, on an Android phone, to any other application that chooses to listen. Someone familiar with speech recognition on the PC might think that the translation is happening on the phone, but no, once again, it's happening on Google's servers. But wait. There's more. Google improves the accuracy of its speech recognition by comparing what the speech algorithms think you said with what its search system (think "Google suggest") expects you were most likely to say. Then, because your phone knows where you are, Google filters the results to find those most relevant to your location.


Your phone knows where you are. How does it do that? "It's got a GPS receiver," is the facile answer. But if it has a GPS receiver, that means your phone is getting its position information by reaching out to a network of satellites originally put up by the US military. It may also be getting additional information from your mobile carrier that speeds up the GPS location detection. It may instead be using "cell tower triangulation" to measure your distance from the nearest cellular network towers, or even doing a lookup from a database that maps wifi hotspots to GPS coordinates. (These databases have been created by driving every street and noting the location and strength of every Wi-Fi signal.) The iPhone relies on the Skyhook Wireless service to perform these lookups; Google has its own equivalent, doubtless created at the same time as it created the imagery for Google Streetview.


But whichever technique is being used, the application is relying on network-available facilities, not just features of your phone itself. And increasingly, it's hard to claim that all of these intertwined features are simply an application, even when they are provided by a single company, like Google.


Keep following the plot. What mobile app (other than casual games) exists solely on the phone? Virtually every application is a network application, relying on remote services to perform its function.


Where is the "operating system" in all this? Clearly, it is still evolving. Applications use a hodgepodge of services from multiple different providers to get the information they need.


But how different is this from PC application development in the early 1980s, when every application provider wrote their own device drivers to support the hodgepodge of disks, ports, keyboards, and screens that comprised the still emerging personal computer ecosystem? Along came Microsoft with an offer that was difficult to refuse: We'll manage the drivers; all application developers have to do is write software that uses the Win32 APIs, and all of the complexity will be abstracted away.


It was. Few developers write device drivers any more. That is left to device manufacturers, with all the messiness hidden by "operating system vendors" who manage the updates and often provide generic APIs for entire classes of device. Those vendors who took on the pain of managing complexity ended up with a powerful lock-in. They created the context in which applications have worked ever since.


This is the crux of my argument about the internet operating system. We are once again approaching the point at which the Faustian bargain will be made: simply use our facilities, and the complexity will go away. And much as happened during the 1980s, there is more than one company making that promise. We're entering a modern version of "the Great Game", the rivalry to control the narrow passes to the promised future of computing. (John Battelle calls them "points of control".) This rivalry is seen most acutely in mobile applications that rely on internet services as back-ends. As Nick Bilton of the New York Times described it in a recent article comparing the Google Nexus One and the iPhone:


Chad Dickerson, chief technology officer of Etsy, received a pre-launch Nexus One from Google three weeks ago. He says Google's phone feels connected to certain services on the Web in a way the iPhone doesn't. "Compared to the iPhone, the Google phone feels like it's part of the Internet to me," he said. "If you live in a Google world, you have that world in your pocket in a way that's cleaner and more connected than the iPhone."


The same thing applies to the iPhone. If you're a MobileMe, iPhoto, iTunes or Safari user, the iPhone connects effortlessly to your pictures, contacts, bookmarks and music. But if you use other services, you sometimes need to find software workarounds to get access to your content.


In comparison, with the Nexus One, if you use GMail, Google Calendar or Picasa, Google's online photo storage software, the phone connects effortlessly to these services and automatically syncs with a single log-in on the phone.


The phones work perfectly with their respective software, but both of them don't make an effort to play nice with other services.



Never mind the technical details of whether the Internet really has an operating system or not. It's clear that in mobile, we're being presented with a choice of platforms that goes far beyond the operating system on the handheld device itself.


With that preamble, let's take a look at the state of the Internet Operating System - or rather, competing Internet Operating Systems - as they exist today.

The Internet Operating System is an Information Operating System

Among many other functions, a traditional operating system coordinates access by applications to the underlying resources of the machine - things like the CPU, memory, disk storage, keyboard and screen. The operating system kernel schedules processes, allocates memory, manages interrupts from devices, handles exceptions, and generally makes it possible for multiple applications to share the same hardware.

Web 2.0 Expo San Francisco As a result, it's easy to jump to the conclusion that "cloud computing" platforms like Amazon Web Services, Google App Engine, or Microsoft Azure, which provide developers with access to storage and computation, are the heart of the emerging Internet Operating System.

Cloud infrastructure services are indeed important, but to focus on them is to make the same mistake as Lotus did when it bet on DOS remaining the operating system standard rather than the new GUI-based interfaces. After all, Graphical User Interfaces weren't part of the "real" operating system, but just another application-level construct. But even though for years, Windows was just a thin shell over DOS, Microsoft understood that moving developers to higher levels of abstraction was the key to making applications easier to use.

But what are these higher levels of abstraction? Are they just features that hide the details of virtual machines in the cloud, insulating the developer from managing scaling or hiding details of 1990s-era operating system instances in cloud virtual machines?

The underlying services accessed by applications today are not just device components and operating system features, but data subsystems: locations, social networks, indexes of web sites, speech recognition, image recognition, automated translation. It's easy to think that it's the sensors in your device - the touch screen, the microphone, the GPS, the magnetometer, the accelerometer - that are enabling their cool new functionality. But really, these sensors are just inputs to massive data subsystems living in the cloud.

When, for example, as an iPhone developer, you use the iPhone's Core Location Framework to establish the phone's location, you aren't just querying the sensor, you're doing a cloud data lookup against the results, transforming GPS coordinates into street addresses, or perhaps transforming WiFi signal strength into GPS coordinates, and then into street addresses. When the Amazon app or Google Goggles scans a barcode, or the cover of a book, it isn't just using the camera with onboard image processing, it's passing the image to much more powerful image processing in the cloud, and then doing a database lookup on the results.

Increasingly, application developers don't do low-level image recognition, speech recognition, location lookup, social network management and friend connect. They place high level function calls to data-rich platforms that provide these services.

With that in mind, let's consider what new subsystems a "modern" Internet Operating System might contain:

Search

Because the volume of data to be managed is so large, because it is constantly changing, and because it is distributed across millions of networked systems, search proved to be the first great challenge of the Internet OS era. Cracking the search problem requires massive, ongoing crawling of the network, the construction of massive indexes, and complex algorithmic retrieval schemes to find the most appropriate results for a user query. Because of the complexity, only a few vendors have succeeded with web search, most notably Google and Microsoft. Yahoo! and Amazon too built substantial web search capabilities, but have largely left the field to the two market leaders.

However, not all search is as complex as web search. For example, an e-commerce site like Amazon doesn't need to constantly crawl other sites to discover their products; it has a more constrained retrieval problem of finding only web pages that it manages itself. Nonetheless, search is fractal, and search infrastructure is replicated again and again at many levels across the internet. This suggests that there are future opportunities in harnessing distributed, specialized search engines to do more complete crawls than can be done by any single centralized player. For example, Amazon harnesses data visible only to them, such as the rate of sales, as well as data they publish, such as the number and value of customer reviews, in ranking the most popular products.

In addition to web search, there are many specialized types of media search. For example, any time you put a music CD into an internet-connected drive, it immediately looks up the track names in CDDB using a kind of fingerprint produced by the length and sequence of each of the tracks on the CD. Other types of music search, like the one used by cell phone applications like Shazam, look up songs by matching their actual acoustic fingerprint. Meanwhile, Pandora's "music genome project" finds similar songs via a complex of hundreds of different factors as analyzed by professional musicians.

Many of the search techniques developed for web pages rely on the rich implied semantics of linking, in which every link is a vote, and votes from authoritative sources are ranked more highly than others. This is a kind of implicit user-contributed metadata that is not present when searching other types of content, such as digitized books. There, search remains in the same brute-force dark ages as web search before Google. We can expect significant breakthroughs in search techniques for books, video, images, and sound to be a feature of the future evolution of the Internet OS.

The techniques of algorithmic search are an essential part of the developer's toolkit today. The O'Reilly book Programming Collective Intelligence reviews many of the algorithms and techniques. But there's no question that this kind of low-level programming is ripe for a higher-level solution, in which developers just place a call to a search service, and return the results. Thus, search moves from application to system call.

Media Access

Just as a PC-era operating system has the capability to manage user-level constructs like files and directories as well as lower-level constructs like physical disk volumes and blocks, an Internet-era operating system must provide access to various types of media, such as web pages, music, videos, photos, e-books, office documents, presentations, downloadable applications, and more. Each of these media types requires some common technology infrastructure beyond specialized search:
  • Access Control. Since not all information is freely available, managing access control - providing snippets rather than full sources, providing streaming but not downloads, recognizing authorized users and giving them a different result from unauthorized users - is a crucial feature of the Internet OS. (Like it or not.)
  • The recent moves by News Corp to place their newspapers behind a paywall, as well as the paid application and content marketplace of the iPhone and iPad suggests that the ability to manage access to content is going to be more important, rather than less, in the years ahead. We're largely past the knee-jerk "keep it off the net" reactions of old school DRM; companies are going to be exploring more nuanced ways to control access to content, and the platform provider that has the most robust systems (and consumer expectations) for paid content is going to be in a very strong position.

    In the world of the App Store, paid applications and paid content are re-legitimizing access control (and payment.) Don't assume that advertising will continue to be the only significant way to monetize internet content in the years ahead.

  • Caching. Large media files benefit from being closer to their destination. A whole class of companies exist to provide Content Delivery Networks; these may survive as independent companies, or these services may ultimately be rolled up into the leading Internet OS companies in much the way that Microsoft acquired or "embraced and extended" various technologies on the way to making Windows the dominant OS of the PC era.
  • Instrumentation and analytics Because of the amount of money at stake, an entire industry has grown up around web analytics and search engine optimization. We can expect a similar wave of companies instrumenting social media and mobile applications, as well as particular media types. After all, a video, a game, or an ebook can know how long you watch, when you abandon the product and where you go next.
  • Expect these features to be pushed first by independent companies, like TweetStats or Peoplebrowsr Analytics for Twitter, or Flurry for mobile apps. GoodData, a cloud-based business intelligence platform is being used for analytics on everything from Salesforce applications to online games. (Disclosure: I am an investor and on the board of GoodData.) But eventually, via acquisition or imitation, they will become part of the major platforms.

Communications

The internet is a communications network, and it's easy to forget that communications technologies like email and chat, have long been central to the Internet's appeal. Now, with the widespread availability of VoIP, and with the mobile phone joining the "network of networks," voice and video communications are an increasingly important part of the communications subsystem.

Communications providers from the Internet world are now on a collision course with communications providers from the telephony world. For now, there are uneasy alliances right and left. But it isn't going to be pretty once the battle for control comes out into the open.

I expect the communications directory service to be one of the key battlefronts. Who will manage the lookup service that allows individuals and businesses to find and connect to each other? The phone and email address books will eventually merge with the data from social networks to provide a rich set of identity infrastructure services.

Identity and the Social Graph

When you use Facebook Connect to log into another application, and suddenly your friends' faces are listed in the new application, that application is using Facebook as a "subsystem" of the new Internet OS. On Android phones, simply add the Facebook application, and your phone address book shows the photos of your Facebook friends. Facebook is expanding the range of data revealed by Facebook Connect; they clearly understand the potential of Facebook as a platform for more than hosted applications.

But as hinted at above, there are other rich sources of social data - and I'm not just talking about applications like Twitter that include explicit social graphs. Every communications provider owns a treasure trove of social data. Microsoft has piles of social data locked up in Exchange, Outlook, Hotmail, Active Directory, and Sharepoint. Google has social data not just from Orkut (an also-ran in the US) but from Gmail and Google Docs, whose "sharing" is another name for "meaningful source of workgroup-level social graph data." And of course, now, there's the social graph data produced by the address book on every Android phone...

The breakthroughs that we need to look forward to may not come from explicitly social applications. In fact, I see "me too" social networking applications from those who have other sources of identity data as a sign that they don't really understand the platform opportunity. Building a social network to rival Facebook or Twitter is far less important to the future of the Internet platform than creating facilities that will allow third-party developers to leverage the social data that companies like Google, Microsoft, Yahoo!, AOL - and phone companies like ATT, Verizon and T-Mobile - have produced through years or even decades of managing user's social data for communications.

Of course, use of this data will require breakthroughs in privacy mechanism and policy. As Nat Torkington wrote in email after reviewing an earlier draft of this post:

We still face the problem of "friend": my Docs social graph is different from my email social graph is different from my Facebook social graph is different from my address book. I want to be able to complain about work to my friends without my coworkers seeing it, and the usability-vs-privacy problem remains unsolved.

Whoever cracks this code, providing frameworks that make it possible for applications to be functionally social without being socially promiscuous, will win. Platform providers are in a good position to solve this problem once, so that users don't have to give credentials to a larger and larger pool of application providers, with little assurance that the data they provide won't be misused.

Payment

Payment is another key subsystem of the Internet Operating System. Companies like Apple that have 150 million credit cards on file and a huge population of users accustomed to using their phones to buy songs, videos, applications, and now ebooks, are going to be in a prime position to turn today's phone into tomorrow's wallet. (And as anyone who reaches into a wallet not for payment but for ID knows, payment systems are also powerful, authenticated identity stores - a fact that won't always be lost on payment providers looking for their lock on a piece of the Internet future.)

PayPal obviously plays an important role as an internet payment subsystem that's already in wide use by developers. It operates in 190 countries, in 19 different currencies (not counting in-game micro-currencies) and it has over 185 million accounts. What's fascinating is the rich developer ecosystem they've built around payment - their recent developer conference had over 2000 attendees. Their challenge is to make the transition from the web to mobile.

Google Checkout has been a distant also-ran in web payments, but the Android Market has given it new prominence in mobile, and will eventually make it a first class internet payment subsystem.

Amazon too has a credible payment offering, though until recently they haven't deployed it to full effect, reserving the best features for their own e-commerce site and not making them available to developers. (More on that in next week's post, in which I will handicap the leading platform offerings from major internet vendors.)

Advertising

Advertising has been the most successful business model on the web. While there are signs that e-commerce - buying everything from virtual goods to a lunchtime burrito - may be the bigger opportunity in mobile (and perhaps even in social media), there's no question that advertising will play a significant role.

Google's dominance of search advertising has involved better algorithmic placement, as well as the ability to predict, in real time, how often an ad will be clicked on, allowing them to optimize the advertising yield. The Google Ad Auction system is the heart of their economic value proposition, and demonstrates just how much difference a technical edge can make.

And advertising has always been a platform play. Signs that it will be a key battleground of the Internet OS can be seen in the competing acquisition of AdMob by Google and Quattro Wireless by Apple.

The question is the extent to which platform companies will use their advertising capabilities as a system service. Will they treat these assets as the source of competitive advantage for their own products, or will they find ways to deploy advertising as a business model for developers on their platform?

Location

Location is the sine-qua-non of mobile apps. When your phone knows where you are, it can find your friends, find services nearby, and even better authenticate a transaction.

Maps and directions on the phone are intrinsically cloud services - unlike with dedicated GPS devices, there's not enough local storage to keep all the relevant maps on hand. But when turned into a cloud application, maps and directions can include other data, such as real-time traffic (indeed, traffic data collected from the very applications that are requesting traffic updates - a classic example of "collective intelligence" at work.)

Location is also the search key for countless database lookup services, from Google's "search along route" to a Yelp search for nearby cafes to the Chipotle app routing your lunch request to the restaurant near you.

O'Reilly Where 2010 Conference In many ways, Location is the Internet data subsystem that is furthest along in its development as a system service accessible to all applications, with developers showing enormous creativity in using it in areas from augmented reality to advertising. (Understanding that this would be the case, I launched the Where 2.0 Conference in 2005. There are lessons to be learned in the location market for all Internet entrepreneurs, not just "geo" geeks, as techniques developed here will soon be applied in many other areas.)

Activity Streams

Location is also becoming a proxy for something else: attention. The My location is a box of cereal." (Disclosure: O'Reilly AlphaTech Ventures is an investor in Foursquare.)

We thus see convergence between Location and social media concepts like Activity Streams. Platform providers that understand and exploit this intersection will be in a stronger position than those who see location only in traditional terms.

Time

Time is an important dimension of data driven services - at least as important as location, though as yet less fully exploited. Calendars are one obvious application, but activity streams are also organized as timelines; stock charts link up news stories with spikes or drops in price. Time stamps can also be used as a filter for other data types (as Google measures frequency of update in calculating search results, or as an RSS feed or social activity stream organizes posts by recency.)

"Real time" - as in the real-time search provided by Twitter, the "where am I now" pointer on a map, the automated replenishment of inventory at WalMart, or instant political polling - emphasizes just how much the future will belong to those who measure response time in milliseconds, or even microseconds, rather than seconds, hours, or days. This need for speed is going to be a major driver of platform services; individual applications will have difficulty keeping up.

Image and Speech Recognition

As I've written previously, one of the big differences since I first wrote Web Squared).

With the advent of smartphone apps like Google Goggles and the Amazon e-commerce app, which deploy advanced image recognition to scan bar codes, book covers, album covers and more - not to mention gaming platforms like Microsoft's still unreleased Project Natal and innovative startups like Affective Interfaces, it's clear that computer vision is going to be an important part of the UI toolkit for future developers. While there are good computer vision packages like OpenCV that can be deployed locally for robotics applications, as well as research projects like those competing in the DARPA Grand Challenge for automated vehicles, for smartphone applications, image recognition, like speech recognition, happens in the cloud. Not only is there a wealth of compute cycles, there are also vast databases of images for matching purposes. Picasa and Flickr are no longer just consumer image sharing sites: they are vast repositories of tagged image data that can be used to train algorithms and filter results.

Government Data

Gov 2.0 Expo 2010 Long before recent initiatives like FixMyStreet and SeeClickFix submit 311 reports to local governments - potholes that need filling, graffiti that needs repainting, streetlights that are out. These applications have typically overloaded existing communications channels like email and SMS, but there are now attempts to standardize an Open311 web services protocol.

Now, a new flood of government data is being released, and the government is starting to see itself as a platform provider, providing facilities for private sector third parties to build applications. This idea of Government as a Platform is a key focus of my advocacy about Government 2.0.

There is huge opportunity to apply the lessons of Web 2.0 and apply them to government data. Take health care as an example. How might we improve our healthcare system if Medicare provided a feedback loop about costs and outcomes analogous to the one that Google built for search keyword advertising.

Anyone building internet data applications would be foolish to underestimate the role that government is going to play in this unfolding story, both as provider and consumer of data web services, and also as regulator in key areas like privacy, access, and interstate commerce.

What About the Browser?

While I think that claims that the browser itself is the new operating system are as misguided as the idea that it can be found solely in cloud infrastructure services, it is important to recognize that control over front end interfaces is at least as important as back-end services. Companies like Apple and Google that have substantial cloud services and a credible mobile platform play are in the catbird seat in the platform wars of the next decade. But the browser, and with it control of the PC user experience, is also critical.

This is why Apple's iPad, Google's ChromeOS, and HTML 5 (plus initiatives like Google's Native Client) are so important. Microsoft isn't far wrong in its cloud computing vision of "Software Plus Services." The full operating system stack includes back end infrastructure, the data subsystems highlighted in this article, and rich front-ends.

Apple and Microsoft largely have visions of vertically integrated systems; Google's vision seems to be for open source driving front end interfaces, while back end services are owned by Google. But in each case, there's a major drive to own a front-end experience that favors each company's back-end systems.

What's Still Missing

Even the most advanced Internet Operating System platforms are still missing many concepts that are familiar to those who work with traditional single-computer operating systems. Where is the executive? Where is the memory management?

I believe that these functions are evolving at each of the cloud platforms. Tools like memcache or mapreduce are the rough cloud equivalents of virtual memory or multiprocessing features in a traditional operating system. But they are only the beginning. Werner Vogels' post Eventually Consistent highlights some of the hard technical issues that will need to be solved for an internet-scale operating system. There are many more.

But it's also clear that there are many opportunities to build higher level functionality that will be required for a true Internet Operating System.

Might an operating system of the future manage when and how data is collected about individuals, what applications can access it, and how they might use it? Might it not automatically synchronize data between devices and applications? Might it do automatic translation, and automatic format conversion between different media types? Might such an operating system do predictive analytics to collect or locally cache data that it expects an individual user or device to need? Might such an operating system do "garbage collection" not of memory pointers but of outdated data or spam? Might it not perform credit checks before issuing payments and suspend activity for those who violate terms of service?

There is a great opportunity for developers with vision to build forward-looking platforms that aim squarely at our connected future, that provide applications running on any device with access to rich new sources of intelligence and capability. The possibilities are endless. There will be many failed experiments, many successes that will be widely copied, a lot of mergers and acquisitions, and fierce competition between companies with different strengths and weaknesses.

Next week, I'll handicap the leading players and tell you what I think of their respective strategies.

March 03 2010

1 in 4 Facebook Users Come From Asia or the Middle East

Asia's share of the more than 400 million active Facebook users recently surged past 15%:


pathint


With a market penetration of 1.7% in Asia and Africa, the company has barely scratched the surface in both regions. While the company continued to add users in Southeast Asia, there were an additional 2.3 million users from South Asia over the past 12 weeks. In fact according to Alexa, Facebook has already overtaken Orkut in India. It didn't take long for Facebook to threaten Friendster's leadership position in Southeast Asia so something similar was likely to happen in India. But I thought it would take them longer to overtake Orkut in India.

The share of users from the Middle East / North Africa remains stable (at just over 8%) and the region had the second fastest-growth rate over the past 12 weeks:

pathint

As was the case in my previous post, the share of users age 18-25 remains higher in regions outside the U.S., especially in Asia, the Middle East / North Africa, Africa, and South America. [For recent growth rates by age group, click HERE.]

pathint

While Asia and the Middle East are the fastest-growth regions, Facebook continues to add users everywhere. Eastern Europe continued to be fertile territory, with the company close to doubling its active members in Romania (up 86% over the last 12 weeks). Below is a list of fastest-growth countries in each region:

pathint

(†) Speaking of Orkut, for what it's worth, Facebook added 800,000 active users in Brazil over the past 12 weeks.

Reposted bymenphrad menphrad

January 07 2010

Pew Research asks questions about the Internet in 2020

Pew Research, which seems to be interested in just about everything,
conducts a "future of the Internet" survey every few years in which
they throw outrageously open-ended and provocative questions at a
chosen collection of observers in the areas of technology and
society. Pew makes participation fun by finding questions so pointed
that they make you choke a bit. You start by wondering, "Could I
actually answer that?" and then think, "Hey, the whole concept is so
absurd that I could say anything without repercussions!" So I
participated in their href="http://www.pewinternet.org/Reports/2006/The-Future-of-the-Internet-II.aspx"
2006 survey and did it again this week. The Pew report will
aggregate the yes/no responses from the people they asked to
participate, but I took the exercise as a chance to hammer home my own
choices of issues.

(If you'd like to take the survey, you can currently visit

http://www.facebook.com/l/c6596;survey.confirmit.com/wix2/p1075078513.aspx

and enter PIN 2000.)

Will Google make us stupid?

This first question is not about a technical or policy issue on the
Internet or even how people use the Internet, but a purported risk to
human intelligence and methods of inquiry. Usually, questions about
how technology affect our learning or practice really concern our
values and how we choose technologies, not the technology itself. And
that's the basis on which I address such questions. I am not saying
technology is neutral, but that it is created, adopted, and developed
over time in a dialog with people's desires.

I respect the questions posed by Nicholas Carr in his Atlantic
article--although it's hard to take such worries seriously when he
suggests that even the typewriter could impoverish writing--and would
like to allay his concerns. The question is all about people's
choices. If we value introspection as a road to insight, if we
believe that long experience with issues contributes to good judgment
on those issues, if we (in short) want knowledge that search engines
don't give us, we'll maintain our depth of thinking and Google will
only enhance it.

There is a trend, of course, toward instant analysis and knee-jerk
responses to events that degrades a lot of writing and discussion. We
can't blame search engines for that. The urge to scoop our contacts
intersects with the starvation of funds for investigative journalism
to reduce the value of the reports we receive about things that are
important for us. Google is not responsible for that either (unless
you blame it for draining advertising revenue from newspapers and
magazines, which I don't). In any case, social and business trends
like these are the immediate influences on our ability to process
information, and searching has nothing to do with them.

What search engines do is provide more information, which we can use
either to become dilettantes (Carr's worry) or to bolster our
knowledge around the edges and do fact-checking while we rely mostly
on information we've gained in more robust ways for our core analyses.
Google frees the time we used to spend pulling together the last 10%
of facts we need to complete our research. I read Carr's article when
The Atlantic first published it, but I used a web search to pull it
back up and review it before writing this response. Google is my
friend.

Will we live in the cloud or the desktop?

Our computer usage will certainly move more and more to an environment
of small devices (probably in our hands rather than on our desks)
communicating with large data sets and applications in the cloud.
This dual trend, bifurcating our computer resources between the tiny
and the truly gargantuan, have many consequences that other people
have explored in depth: privacy concerns, the risk that application
providers will gather enough data to preclude competition, the
consequent slowdown in innovation that could result, questions about
data quality, worries about services becoming unavailable (like
Twitter's fail whale, which I saw as recently as this morning), and
more.

One worry I have is that netbooks, tablets, and cell phones will
become so dominant that meaty desktop systems will rise in the cost
till they are within the reach only of institutions and professionals.
That will discourage innovation by the wider populace and reduce us to
software consumers. Innovation has benefited a great deal from the
ability of ordinary computer users to bulk up their computers with a
lot of software and interact with it at high speeds using high quality
keyboards and large monitors. That kind of grassroots innovation may
go away along with the systems that provide those generous resources.

So I suggest that cloud application providers recognize the value of
grassroots innovation--following Eric von Hippel's findings--and
solicit changes in their services from their visitors. Make their code
open source--but even more than that, set up test environments where
visitors can hack on the code without having to download much
software. Then anyone with a comfortable keyboard can become part of
the development team.

We'll know that software services are on a firm foundation for future
success when each one offers a "Develop and share your plugin here"
link.

Will social relations get better?

Like the question about Google, this one is more about our choices
than our technology. I don't worry about people losing touch with
friends and family. I think we'll continue to honor the human needs
that have been hard-wired into us over the millions of years of
evolution. I do think technologies ranging from email to social
networks can help us make new friends and collaborate over long
distances.

I do worry, though, that social norms aren't keeping up with
technology. For instance, it's hard to turn down a "friend" request
on a social network, particularly from someone you know, and even
harder to "unfriend" someone. We've got to learn that these things are
OK to do. And we have to be able to partition our groups of contacts
as we do in real life (work, church, etc.). More sophisticated social
networks will probably evolve to reflect our real relationships more
closely, but people have to take the lead and refuse to let technical
options determine how they conduct their relationships.

Will the state of reading and writing be improved?

Our idea of writing changes over time. The Middle Ages left us lots of
horribly written documents. The few people who learned to read and
write often learned their Latin (or other language for writing) rather
minimally. It took a long time for academies to impose canonical
rules for rhetoric on the population. I doubt that a cover letter and
resume from Shakespeare would meet the writing standards of a human
resources department; he lived in an age before standardization and
followed his ear more than rules.

So I can't talk about "improving" reading and writing without
addressing the question of norms. I'll write a bit about formalities
and then about the more important question of whether we'll be able to
communicate with each other (and enjoy what we read).

In many cultures, writing and speech have diverged so greatly that
they're almost separate languages. And English in Jamaica is very
different from English in the US, although I imagine Jamaicans try
hard to speak and write in US style when they're communicating with
us. In other words, people do recognize norms, but usage depends on
the context.

Increasingly, nowadays, the context for writing is a very short form
utterance, with constant interaction. I worry that people will lose
the ability to state a thesis in unambiguous terms and a clear logical
progression. But because they'll be in instantaneous contact with
their audience, they can restate their ideas as needed until
ambiguities are cleared up and their reasoning is unveiled. And
they'll be learning from others along with way. Making an elegant and
persuasive initial statement won't be so important because that
statement will be only the first step of many.

Let's admit that dialog is emerging as our generation's way to develop
and share knowledge. The notion driving Ibsen's Hedda Gabler--that an
independent philosopher such as Ejlert Løvborg could write a
masterpiece that would in itself change the world--is passé. A
modern Løvborg would release his insights in a series of blogs
to which others would make thoughtful replies. If this eviscerated
Løvborg's originality and prevented him from reaching the
heights of inspiration--well, that would be Løvborg's fault for
giving in to pressure from more conventional thinkers.

If the Romantic ideal of the solitary genius is fading, what model for
information exchange do we have? Check Plato's Symposium. Thinkers
were expected to engage with each other (and to have fun while doing
so). Socrates denigrated reading, because one could not interrogate
the author. To him, dialog was more fertile and more conducive to
truth.

The ancient Jewish scholars also preferred debate to reading. They
certainly had some received texts, but the vast majority of their
teachings were generated through conversation and were not written
down at all until the scholars realized they had to in order to avoid
losing them.

So as far as formal writing goes, I do believe we'll lose the subtle
inflections and wordplay that come from a widespread knowledge of
formal rules. I don't know how many people nowadays can appreciate all
the ways Dickens sculpted language, for instance, but I think there
will be fewer in the future than there were when Dickens rolled out
his novels.

But let's not get stuck on the aesthetics of any one period. Dickens
drew on a writing style that was popular in his day. In the next
century, Toni Morrison, John Updike, and Vladimir Nabokov wrote in a
much less formal manner, but each is considered a beautiful stylist in
his or her own way. Human inventiveness is infinite and language is a
core skill in which we we all take pleasure, so we'll find new ways to
play with language that are appropriate to our age.

I believe there will always remain standards for grammar and
expression that will prove valuable in certain contexts, and people
who take the trouble to learn and practice those standards. As an
editor, I encounter lots of authors with wonderful insights and
delightful turns of phrase, but with deficits in vocabulary, grammar,
and other skills and resources that would enable them to write better.
I work with these authors to bring them up to industry-recognized
standards.

Will those in GenY share as much information about themselves as they age?

I really can't offer anything but baseless speculation in answer to
this question, but my guess is that people will continue to share as
much as they do now. After all, once they've put so much about
themselves up on their sites, what good would it do to stop? In for a
penny, in for a pound.

Social norms will evolve to accept more candor. After all, Ronald
Reagan got elected President despite having gone through a divorce,
and Bill Clinton got elected despite having smoked marijuana.
Society's expectations evolve.

Will our relationship to key institutions change?

I'm sure the survey designers picked this question knowing that its
breadth makes it hard to answer, but in consequence it's something of
a joy to explore.

The widespread sharing of information and ideas will definitely change
the relative power relationships of institutions and the masses, but
they could move in two very different directions.

In one scenario offered by many commentators, the ease of
whistleblowing and of promulgating news about institutions will
combine with the ability of individuals to associate over social
networking to create movements for change that hold institutions more
accountable and make them more responsive to the public.

In the other scenario, large institutions exploit high-speed
communications and large data stores to enforce even greater
centralized control, and use surveillance to crush opposition.

I don't know which way things will go. Experts continually urge
governments and businesses to open up and accept public input, and
those institutions resist doing so despite all the benefits. So I have
to admit that in this area I tend toward pessimism.

Will online anonymity still be prevalent?

Yes, I believe people have many reasons to participate in groups and
look for information without revealing who they are. Luckily, most new
systems (such as U.S. government forums) are evolving in ways that
build in privacy and anonymity. Businesses are more eager to attach
our online behavior to our identities for marketing purposes, but
perhaps we can find a compromise where someone can maintain a
pseudonym associated with marketing information but not have it
attached to his or her person.

Unfortunately, most people don't appreciate the dangers of being
identified. But those who do can take steps to be anonymous or
pseudonymous. As for state repression, there is something of an
escalating war between individuals doing illegal things and
institutions who want to uncover those individuals. So far, anonymity
seems to be holding on, thanks to a lot of effort by those who care.

Will the Semantic Web have an impact?

As organizations and news sites put more and more information online,
they're learning the value of organizing and cross-linking
information. I think the Semantic Web is taking off in a small way on
site after site: a better breakdown of terms on one medical site, a
taxonomy on a Drupal-powered blog, etc.

But Berners-Lee had a much grander vision of the Semantic Web than
better information retrieval on individual sites. He's gunning for
content providers and Web designers the world around to pull together
and provide easy navigation from one site to another, despite wide
differences in their contributors, topics, styles, and viewpoints.

This may happen someday, just as artificial intelligence is looking
more feasible than it was ten years ago, but the chasm between the
present and the future is enormous. To make the big vision work, we'll
all have to use the same (or overlapping) ontologies, with standards
for extending and varying the ontologies. We'll need to disambiguate
things like webbed feet from the World Wide Web. I'm sure tools to
help us do this will get smarter, but they need to get a whole lot
smarter.

Even with tools and protocols in place, it will be hard to get
billions of web sites to join the project. Here the cloud may be of
help. If Google can perform the statistical analysis and create the
relevant links, I don't have to do it on my own site. But I bet
results would be much better if I had input.

Are the next takeoff technologies evident now?

Yes, I don't believe there's much doubt about the technologies that
companies will commercialize and make widespread over the next five
years. Many people have listed these technologies: more powerful
mobile devices, ever-cheaper netbooks, virtualization and cloud
computing, reputation systems for social networking and group
collaboration, sensors and other small systems reporting limited
amounts of information, do-it-yourself embedded systems, robots,
sophisticated algorithms for slurping up data and performing
statistical analysis, visualization tools to report the results of
that analysis, affective technologies, personalized and location-aware
services, excellent facial and voice recognition, electronic paper,
anomaly-based security monitoring, self-healing systems--that's a
reasonable list to get started with.

Beyond five years, everything is wide open. One thing I'd like to see
is a really good visual programming language, or something along those
lines that is more closely matched to human strengths than our current
languages. An easy high-level programming language would immensely
increase productivity, reduce errors (and security flaws), and bring
in more people to create a better Internet.

Will the internet still be dominated by the end-to-end principle?

I'll pick up here on the paragraph in my answer about takeoff
technologies. The end-to-end principle is central to the Internet I
think everybody would like to change some things about the current
essential Internet protocols, but they don't agree what those things
should be. So I have no expectation of a top-to-bottom redesign of the
Internet at any point in our viewfinder. Furthermore, the inertia
created by millions of systems running current protocols would be hard
to overcome. So the end-to-end principle is enshrined for the
foreseeable future.

Mobile firms and ISPs may put up barriers, but anyone in an area of
modern technology who tries to shut the spiget on outside
contributions eventually becomes last year's big splash. So unless
there's a coordinated assault by central institutions like
governments, the inertia of current systems will combine with the
momentum of innovation and public demand for new services to keep
chokepoints from being serious problems.

December 25 2009

Four short links: 25 December 2009

  1. One Billionth Spam Message Stats -- from the honeypot project comes a pile of stats about which countries spam, what they spam for, when they spam, etc. One intriguing insight our data provides is that bad guys take vacations too. For example, there is a 21% decrease in spam on Christmas Day and a 32% decrease on New Year's Day. Monday is the biggest day of the week for spam, while Saturday receives only about 60% of the volume of Monday's messages. Enjoy your day off spam. (via Bruce Schneier)
  2. Flowing Data's Five Best Data Visualization Projects of 2009 -- I think I listed at least four of these in this year's Four Short Links. You're welcome!
  3. Six Degrees of Separation -- tiring of "Sound of Music"? This BBC documentary on the science of social connection may help.
  4. Nanoscale Snowmen -- The snowman is 10 µm across, 1/5th the width of a human hair.. (via BoingBoing)

December 15 2009

Is Facebook a Brand that You Can Trust?

Facebook-Fox.pngIsn't it about time that we started holding our online brands to the same standards that we hold our offline ones?

Case in point, consider Facebook. In Facebook's relatively short life, there has been the Beacon Debacle (a 'social' advertising model that only Big Brother could love), the Scamville Furor (lead gen scams around social gaming) and now, the Privacy Putsch.

By Privacy Putsch, I am referring to Facebook's new 'Privacy' Settings, which unilaterally invoked upon all Facebook users a radically different set of privacy setting defaults than had been in place during the company's build-up to its current 350 million strong user base.

To put a bow around this one, the EFF (Electronic Frontier Foundation), not exactly a bastion of radicalism, concluded after comparing Facebook's new privacy settings with the privacy settings that they replaced:

"Our conclusion? These new 'privacy' changes are clearly intended to push Facebook users to publicly share even more information than before. Even worse, the changes will actually reduce the amount of control that users have over some of their personal data." EFF adds that, "The privacy 'transition tool' that guides users through the configuration will 'recommend' — preselect by default — the setting to share the content they post to Facebook, such as status messages and wall posts, with everyone on the Internet, even though the default privacy level that those users had accepted previously was limited to 'Your Networks and Friends' on Facebook."

Ruminate on what that means for a moment. You are a parent, and you regularly upload photos of your kids to Facebook, blithely assuming that they are free from the roaming eyes of some sexual predator. While previously, these photos were only viewable to the Friends and Networks that you explicitly connected with, now, without consulting you, Facebook has made your son or daughter's pictures readily accessible to friend or felon.

Or, perhaps you are a typical 'thirty something,' sharing your weekend escapades with what you thought was a bounded social circle. Now, your current or prospective employer is just a click away from concluding that, perhaps trusting the company's marketing department to you is not such a good idea after all.

So as not to split hairs, let's just agree that some potential existed for either of these scenarios to have occurred under the old privacy model, and also worth nothing, if you actually understand what these new settings mean to your world, you can reverse (many of) these settings.

But, that's beside the point. Why? Because three separate instances now (i.e., Beacon, Scamville and Privacy Settings) have underscored a tendency of Facebook to not only make fairly key strategic decisions without first engaging it user base in a bi-lateral dialog, but to make decisions that are decidedly at odds with consumer protection/interest.

On a human level, one can look at the new privacy changes as akin to going to sleep at night with the assumption that the various doors and windows of your house were locked, only to wake up and realize that while you were sleeping, the 'locksmith' decided that you/they were better served if the doors were left unlocked.

Upon waking up to discover this unilateral decision, would you be pissed? Would you trust the locksmith to keep you safe at night going forward?

One last example before I move on, here's another excerpt from EFF's analysis on the 'Good, Bad and Ugly' of the new privacy settings:

The Ugly: Information That You Used to Control Is Now Treated as "Publicly Available," and You Can't Opt Out of The "Sharing" of Your Information with Facebook Apps.

Specifically, under the new model, Facebook treats information, such as friends lists, your name, profile picture, current city, gender, networks, and the pages that you are a 'fan' of — as 'publicly available information,' a new definition of heretofore personal information that Facebook held off disclosing in any material way -- until the very day it was forcing the new change on users.

Blogger Jason Calcanis puts this policy in perspective in his excellent post, 'Is Facebook unethical, clueless or unlucky?'

I'm sorry, what the frack just happened? I turned over my friend list, photos and status updates to everyone in the world? Why on earth would anyone do that with their Facebook page? The entire purpose of Facebook since inception has been to share your information with a small group of people in your private network. Everyone knows that and everyone expects that. In fact, Facebook's success is largely based on the face that people feel safe putting their private information on Facebook.

Do with this information what you will (forewarned is forearmed, after all), but me personally, after reviewing each Facebook photo album of mine with personal, family and/or friend oriented photos within it, I couldn't help but feel that Facebook should be given a new name: Faceless Betrayal.

Some Relativity from the World of Offline Brands: Perrier and Tylenol

perrier_water.jpgI read an interesting stat in Fast Company about the US Bottled Water Industry. Americans now spend more on bottled water than they do on iPods or Movie Tickets - $16B dollars.

Now, think back to 1989. Perrier Water was the imported water market leader in North America, with an eponymous water product marketed as 'naturally sparkling' water sourced from a mineral spring in the south of France.

But then, Perrier ran into serious trouble when the noxious, cancer-causing agent, Benzene, was found in the water that Perrier sold in the United States.

Seeking damage control, the company gravitated between silence and evasiveness, with Perrier initially stating that the problem was an isolated one, when in actuality, it had turned out to be a global issue.

Perrier's ultimate mistake, though, was responding to a serious brand integrity crisis in a less than above-board, consultative fashion with its customer base.

The net effect is that, despite a massive global boom in bottled water consumption, a once-trusted, dominant brand, in essence, collapsed. In the end, Perrier's sales fell in half; the company was later sold, and the brand never recovered.

Tylenol Extra Strength.jpgBy contrast, when seven people died after taking cyanide-laced Extra-Strength Tylenol capsules, the company did a massive education and outreach effort, culminating with the recall of 31 million bottles of the product, at a then-cost of $100M. Like Perrier, Tylenol's market share initially cratered.

But, because the company had been proactive, public and always acting in the best interests of its consumers, within a year, its share had rebounded dramatically, and within a few years, had come all the way back.

The moral of the story is that two companies faced crises that threatened to kneecap their brand, but only one maintained a consistent focus on living up to the trust that its customers had put in the brand. Tellingly, the market rewarded the brand that was truest to its customers (Tylenol).

Netting it out: In light of the company's past consumer-unfriendly initiatives, Facebook's privacy settings change should serve as a wake up call to its 350M users that they are entrusting a Fox to guard the Hen House; a truth that is destined to erupt into a crisis for the company. Will they handle it like Tylenol or Perrier?

Related Post:
Why Facebook's Terms of Service Change is Much Ado About Nothing

November 23 2009

More that sociologist Erving Goffman could tell us about social networking and Internet identity

After

posting some thoughts

a month ago about Erving Goffman's classic sociological text, The
Presentation of Self in Everyday Life
, I heard from a reader who
urged me to try out a deeper work of Goffman's, Frame
Analysis
(Harper Colophon, 1974). This blog presents the thoughts
that came to mind as I made my way through that long and rambling
work.

Let me start by shining a light on an odd phenomenon we've all
experienced online. Lots of people on mailing lists, forums, and
social networks react with great alarm when they witness heated
arguments. This reaction, in my opinion, stems from an ingrained
defense mechanism whose intensity verges on the physiological. We've
all learned, from our first forays to the playground as children, that
rough words easily escalate to blows. So we react to these words in
ways to protect ourselves and others.

Rationally, this defense mechanism wouldn't justify intervening in an
online argument. The people arguing could well be on separate
continents, and have close to zero chance of approaching each other
for battle before they cool down.

When asked why forum participants insert themselves between the
fighters--just as they would in a real-life brawl--they usually say,
"It's because I'm afraid of allowing a precedent to be set on this
forum; I might be attacked the same way." But this still begs the
question of what's wrong with an online argument. No forum member is
likely to be victim of violence.

We can apply Goffman's frame analysis to explain the forum members'
distress. It's what he calls a keying: we automatically apply
the lessons of real-life experiences to artificial ones. Keying allows
us to invest artificial circumstances--plays, ceremonies, court
appearances, you name it--with added meaning.

Human beings instinctively apply keyings. When we see a movie
character enter a victim's home carrying a gun, we forget we're
watching a performance and feel some of the same tightness in our
chest that we would feel had it been ourselves someone was stalking.

Naturally, any person of normal mental capacity can recognize the
difference between reality and an artificial re-enactment. We suspend
disbelief when we watch a play, reacting emotionally to the actors as
if they were real people going about their lives, but we don't
intervene when one tries to run another through with a knife, as we
would (one hopes) in real life.

Why do some people jump eagerly into online disputes, while others
plead with them to hold back? This is because, I think, disputes are
framed by different participants in different ways. Yes, some people
attack others in the hope of driving them entirely off the list; their
words are truly aimed at crushing the other. But many people just see
a healthy exchange of views where others see acts of dangerous
aggression. Goffman even had a term for the urge to flee taken up by
some people when they find that actions go too far: flooding
out
.

I should meekly acknowledge here that I play Nice Guy when I post to
online forums: I respect other people for their positions, seek common
ground, etc. I recognize that forums lose members when hotheads are
free to roam and toss verbal bombs, but I think forums may also lose a
dimension by suppressing the hotheads, who often have valid points and
a drive to aid the cause. One could instead announce a policy that
those who wish to flame can do so, and those who wish to ignore them
are also free to do so.

How much of Goffman's sprawling 575-page text applies online? Many
framing devices that he explored in real life simply don't exist on
digital networks. For instance, forums rarely have beginnings and
endings, which are central to framing for Goffman. People just log in
and start posting, experiencing whatever has been happening in the
meantime.

And as we've heard a million times, one can't use clothing, physical
appearance, facial expressions, and gestures to help evaluate online
text. Of course, we have graphics, audio, and video on the Internet
now as well, but they are often used for one-way consumption rather
than rapid interaction. A lot of online interaction is still carried
on in plain text. So authors toss in smileys such as :-) and other
emoticons. But these don't fill the expressiveness gap because they
must be explicitly included in text, and therefore just substitute for
things the author wanted to say in words. What helps makes
face-to-face interactions richer than text interactions is the
constant stream of unconscious vocal and physical signals that we
(often unconsciously) monitor.

So I imagine that, if Goffman returned to add coverage of the Internet
to Frame Analysis, it would form a very short appendix
(although he could be insufferably long-winded). Still, his analyses
of daily life and of performances bring up interesting points that
apply online.

The online forums are so new that we approach them differently from
real-life situations. We have fewer expectations with which to frame
our interactions. We know that we can't base our assumptions on
framing circumstances, such as when we strike up a conversation with
someone we've just met by commenting on the weather or on a dinner
speaker we both heard.

Instead, we frame our interactions explicitly, automatically providing
more context. For instance, when we respond to email, we quote the
original emails in our response (sometimes excessively).

And we judge everybody differently because we know that they choose
what they say carefully. We fully expect the distorted appearances
described in the Boston Globe article

My profile, myself
,
subtitled "Why must I and everyone else on Facebook be so insufferably
happy?" We wouldn't expect to hear about someone's drug problem or
intestinal upset or sexless marriage on Facebook, any more than we'd
expect to hear it when we're sitting with them on a crowded beach.

Goffman points out that the presence of witnesses is a frame in
itself, changing any interaction between two people. This definitely
carries over online where people do more and more posting to their
friend's Facebook Wall (a stream of updates visible to all their other
friends) instead of engaging in private chats.

But while explaining our loss of traditional frames, I shouldn't leave
the impression that nothing takes their place. The online medium has
powerful frames all its own. Thus, each forum is a self-contained
unit. In real-life we can break out of frames, such as when actors
leave the stage and mingle with audience members. This can't happen
within the rigidity of online technology.

It can be interesting to meet the same person on two different forums.
The sometimes subtle differences between forums affect their
presentation on each one. They may post the same message to different
forums, but that's often a poor practice that violates the frames on
one or more forums. So if they copy a posting, they usually precede it
with some framing text to make it appropriate for a particular forum.

Online forums also set up their own frames within themselves, and
these frames can be violated. Thus, a member may start a discussion
thread with the title "Site for new school," but it may quickly turn
into complaints about the mayor or arguments about noise in the
neighborhood. This breaks the frame, and people may go on for some
time posting all manner of comments under the "Site for new school"
heading until they are persuaded to start a new thread or take the
arguments elsewhere.

A frame, for Goffman, is an extremely broad concept (which I believe
weakens its value). Any assumption underlying an interaction can be
considered part of the frame. For instance, participants on forums
dedicated to social or technical interactions often ask whether it's
considered acceptable to post job listings or commercial offerings. In
other words, do forum participants impose a noncommercial mandate as
part of the frame?

A bit of history here can help newer Internet denizens understand
where this frame comes from. When the Internet began, everything was
run over wires owned by the federal government, the NSFNET Backbone
Network. All communication on the backbone was required to be
noncommercial, a regulation reinforced by the ivory-tower idealism of
many participants. Many years after private companies added new lines
and carried on their business over the Internet, some USENET forums
would react nastily to any posting with a hint of a commercial
message.

Although tedious--despite the amusing anecdotes--my read of Frame
Analysis
was useful because I realized how much of our lives is
lived in context (that is, framed), and how adrift we are when we are
deprived of those frames in today's online environments--cognitively
we know we are deprived, but we don't fully accept its implications.
Conversely, I think that human beings crave context, community, and
references. So the moment we go online, we start to recreate those
things. Whether we're on a simple mailing list or a rich 3D virtual
reality site, we need to explicitly recreate context, community, and
references. It's worth looking for the tools to do so, wherever we
land online.

November 20 2009

Asia Continues to be Facebook's Strongest Growth Region

With Facebook topping 330 million active users over the past week, the company's strongest growth region continues to be Asia. Over the last 12 weeks, Facebook added close to 17M active users in Asia alone. Since my previous post, the share of active users from Asia grew by 2% (to 13.5% of all users), and roughly 1 in 7 users now come from the region. With a market penetration under 2%, Facebook is poised to add many more users in Asia (and Africa).


pathint


Compared to the U.S., the proportion of Facebook users in their teens (13-17) or in the 18-25 age group are much higher in Asia:


pathint


As was the case in other parts of the world, expect the share of users 45 and older to climb as Facebook becomes more mainstream in Asia. Growth was strong across all age groups in Asia over the last 12 weeks, particularly among teens (+90%) and the 18-25 age group (+60%).


pathint

In other regions, notably North America, Europe, the Middle East, and South America, growth in the 18-25 age bracket, lagged behind users 45 and older.


In closing I want to highlight countries (within several regions) where Facebook has been growing rapidly:

pathint

In Europe, growth has been fastest in the East: as an example, the number of active users in Poland doubled over the last 12 weeks. Growth in Southeast Asia remains strong in countries that have been home to Friendster's core user base. While Facebook added over 800,000 active users in Brazil, for now Orkut remains the dominant social network in South America's most populous country.

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl