Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

November 08 2011

When good feedback leaves a bad impression a teacher is prone to hyperbole — lots of "greats!" and "excellents!" and "A+++" grades — it's natural for a student to perceive a mere "good" as an undesirable response. According to Panagiotis Ipeirotis, associate professor at New York University, the same perception applies to online reviews.

In a recent interview, Ipeirotis touched on the the negative impact of good-enough reviews and a host of other data-related topics. Highlights from the interview (below) included:

  • Sentiment analysis is a commonly used tool for measuring what people are saying about a particular company or brand, but it has issues. "The problem with sentiment analysis," said Ipeirotis, "is that it tends to be rather generic, and it's not customized to the context in which people read." Ipeirotis pointed to Amazon as a good example here, where customer feedback about a merchant that says "good packaging" might initially appear as positive sentiment, but "good" feedback can have a negative effect on sales. "People tend to exaggerate a lot on Amazon. 'Excellent seller.' 'Super-duper service.' 'Lightning-fast delivery.' So when someone says 'good packaging,' it's perceived as, 'that's all you've got?'" [Discussed at the 0:42 mark.]
  • Ipeirotis suggested that people should challenge the initial conclusions they make from data. "Every time that something seems to confirm your intuition too much, I think it's good to ask for feedback." [Discussed at 2:24.]
  • Ipeirotis has done considerable research on Amazon's Mechanical Turk (MTurk) platform. He described MTurk as "an interesting example of a market that started with the wrong design." Amazon thought that its cloud-based labor service would be "yet another of its cloud services." But a market that "involves people who are strategic and responding to incentives," said Ipeirotis, "is very different than a market for CPUs and so on." Because Amazon didn't take this into consideration early on, the service has faced spam and reputation issues. Ipeirotis pointed to the site's use of anonymity as an example: Anonymity was supposed to protect privacy, but it's actually hurt some of the people who are good at what they do because anonymity is often associated with spammers. [Discussed at 2:55.]

The full interview is available in the following video:

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Some quotes from this interview were edited and condensed for clarity.


December 20 2010

Four short links: 20 December 2010

  1. Gawker Tech Team Didn't Adequately Secure Our Platform -- internal memo from CTO to staff after the break-in. Notable for two things: the preventative steps, which include things like two-factor authentication and not collecting commenter details; and the lack of defensiveness. When your executives taunt 4chan and your systems get pwned as a result, it must be mighty hard not to point the finger at those executives. I hope I can be as adult as Tom Plunkett when shit next happens to me. (via Andy Baio)
  2. Mechanical Turk Spam -- 40% of the HITs from new requesters are spam. The list of tasks is the online fraud hitlist: faking votes/comments/etc on social sites, making fake accounts, submitting fake leads through lead gen sites, fake clicks on ads, posting fake ads to Craigslist, requesting personal info of the MTurk worker. (via Andy Baio who is on fire)
  3. 2010 The Year Open Source Went Invisible (Matt Asay) -- All of which is a long way of saying that while open source has become integral to so much software development, it hasn't remotely ended the reign of proprietary software. Indeed, much (most?) open-source software is paid for out of proprietary profits. This might have been shocking news in, say, 2004, but it's common knowledge in 2010. Open source is how we do business 10 years into this new millennium.
  4. Quantitative Analysis of Culture Using Millions of Digitized Books (Science) -- We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. This is related to Google Labs' latest toy, the n-gram viewer whose correct name should be Google Pottymouth if the things people are graphing are anything to go by.

October 25 2010

Crowdsourcing specific microtasks

Since the first-ever Mechanical Turk meetup a year ago, there has been an explosion in crowdsourcing services and a well-attended conference in San Francisco. I remain enthusiastic about crowdsourcing, but the number of companies has me worried about quality of work. Fortunately specialization is already occurring, so for particular tasks there are companies out there ready to provide high-quality service.

One company that recently caught my eye is Helsinki (and SF) based Microtask. Founded by Computer Graphics (CG) and Computer Vision (CV) veterans, Microtask has chosen to focus in a few areas where CG and CV are relevant. Aside from speech transcription, they currently provide form-processing (digitizing hand-written forms) and archive digitization services, and have plans to expand into image categorization and video indexing in the near future. By initially focusing on a few specific tasks, Microtask is able to refine its platform while simultaneously leveraging prior skills in areas such as optical character recognition.


A few things about the Microtask platform are worth highlighting. In order to protect the intellectual property of its customers, Microtask never sends complex tasks to the same service provider. Rather tasks get broken up into pieces and scattered across multiple providers. This is fairly easy to do for the types of digitization services they offer. Customers who are wary of sending data to outside servers, can run Microtask's software on their own servers. (Ordinarily customers use Microtask through a set of API's.) Finally, Microtask can guarantee quality and delivery time because it has longterm relationships with (labor) service providers. Microtask contracts out with call-centers throughout the world, and tasks* are performed by workers in-between service calls.

Using call-center workers is novel but crowdsourcing seems increasingly tied to social gaming and virtual currencies. Having come from Computer Graphics and Computer Vision, the founders of Microtask have experience and connections in the gaming industry. CEO Ville Miettinen admitted that social gaming integration is a high-priority for them over the next few years. The key is that they want Microtask to fit seamlessly into the gaming experience, they want gamers to be able to stay "in the flow of the game" while performing crowdsourcing tasks. I'm looking forward to what they and game designers come up with -- a modern equivalent of Typing of the Dead?

(*) It's useful to remember that these are simple tasks (e.g., OCR) involving the validation of outputs generated using machine-learning. Microtask uses confidence scores generated by their algorithms to rank and prioritize validation tasks.

June 30 2010

Four short links: 30 June 2010

  1. Publishers Who Don't Know History ... (Cory Ondrejka) -- interesting thoughts on publishing. Friends share, borrow, and recommend books. Currently, publishers are generally being stupid about this.
  2. Regulating Distributed Work -- should Mechanical Turk and so on have specific labour laws? This is the case in favour.
  3. We Are What We Choose -- Jeff Bezos's graduation speech to Princeton's Class of 2010. Well worth reading.
  4. The Velluvial Matrix (New Yorker) -- Atul Gawande's graduation speech to Stanford's School of Medicine. The truth is that the volume and complexity of the knowledge that we need to master has grown exponentially beyond our capacity as individuals. Worse, the fear is that the knowledge has grown beyond our capacity as a society. When we talk about the uncontrollable explosion in the costs of health care in America, for instance—about the reality that we in medicine are gradually bankrupting the country—we’re not talking about a problem rooted in economics. We’re talking about a problem rooted in scientific complexity. (via agpublic on Twitter)

May 11 2010

Crowdsourcing and the challenge of payment

An unusual href="">Distributed
Distributed Work Meetup was held last night in four different
cities simultaneously, arranged through many hours of hard work by href="">Lukas
Biewald and his colleagues at distributed work provider href="">CrowdFlower.

With all the sharing of experiences and the on-the-spot analyses
taking place, I didn't find an occasion to ask my most pressing
question, so I'll put it here and ask my readers for comments:

How can you set up crowdsourcing where most people work for free but
some are paid, and present it to participants in a way that makes it
seem fair?

This situation arises all the time, with paid participants such as
application developers and community managers, but there's a lot of
scary literature about "crowding out" and other dangers. One basic
challenge is choosing what work to reward monetarily. I can think of
several dividing lines, each with potential problems:

  • Pay for professional skills and ask for amateur contributions on a
    volunteer basis.

    The problem with that approach is that so-called amateurs are invading
    the turf of professionals all the time, and their deft ability to do
    so has been proven over and over at crowdsourcing sites such as href="">InnoCentive for inventors and
    SpringLeap or href="">99 Designs for designers. Still,
    most people can understand the need to pay credentialed professionals
    such as lawyers and accountants.

  • Pay for extraordinary skill and accept more modest contributions on a
    volunteer basis.

    This principle usually reduces to the previous one, because there's no
    bright line dividing the extraordinary from the ordinary. Companies
    adopting this strategy could be embarrassed when a volunteer turns in
    work whose quality matches the professional hires, and MySQL AB in
    particular was known for hiring such volunteers. But if it turns out
    that a large number of volunteers have professional skills, the whole
    principle comes into doubt.

  • Pay for tasks that aren't fun.

    The problem is that it's amazing what some people consider fun. On the
    other hand, at any particular moment when you need some input, you
    might be unable to find people who find it fun enough to do it for
    you. This principle still holds some water; for instance, I heard
    Linus Torvalds say that proprietary software was a reasonable solution
    for programming tasks that nobody would want to do for personal

  • Pay for critical tasks that need attention on an ongoing basis.

    This can justify paying people to monitor sites for spam and
    obscenity, keep computer servers from going down, etc. The problem
    with this is that no human being can be on call constantly. If you're
    going to divide a task among multiple people, you'll find that a
    healthy community tends to be more vigilant and responsive than
    designated individuals.

I think there are guidelines for mixing pay with volunteer work, and
I'd like to hear (without payment) ideas from the crowd.

Now I'll talk a bit about the meetup.

Venue and setup

I just have to start with the Boston-area venue. I had come to many
events at the MIT Media Lab and had always entered Building E14 on the
southwest side. The Lab struck me as a musty, slightly undermaintained
littered with odd jetsam and parts of unfinished projects; a place you
could hardly find your way around but that almost dragged creativity
from you into the open. The Lab took up a new building in 2009 but to
my memory the impact is still similar--it's inherent to the mission
and style of the researchers.

For the first time last night, I came to the building's northeast
entrance, maintained by the MIT School of Architecture. It is Ariel to
the Media Lab's Caliban: an airy set of spacious white-walled forums
sparsely occupied by chairs and occasional art displays. In a very
different way, this space also promotes open thoughts and broad

The ambitious agenda called for the four host cities (Boston, New
York, San Francisco, and Seattle) to share speakers over
videoconferencing equipment. Despite extensive preparation, we all had
audio, video, and connectivity problems at the last minute (in fact,
the Boston organizers crowdsourced the appeal for a laptop and I
surrendered mine for the video feed). Finally in Boston we
disconnected and had an old-fashioned presentation/discusser with an
expert speaker.

In regard to the MIT Media Lab and Architecture School, I think it's
amusing to report that Foursquare didn't recognize either one when I
asked for my current location. Instead, Foursquare offered a variety
of sites across the river, plus the nearby subway, the bike path, and
a few other oddities.

We were lucky to have href="">Jeff Howe, the
WIRED contributor who invented the term href="">Crowdsourcing and wrote a
popular href="">book
on it. He is currently a Nieman Fellow at Harvard. His talk was wildly
informal (he took an urgent call from a baby sitter in the middle) but
full of interesting observations and audience interactions.

He asked us to promote his current big project with WIRED, href="">
One Book, One Twitter. His goal is to reproduce globally the
literacy projects carried out in many cities (one happens every year
in my town, Arlington, Mass.) where a push to get everyone to read a
book is accompanied by community activities and meetups. Through a
popular vote on WIRED, the book href="">American Gods by
Neil Gaiman was chosen, and
people are tweeting away at #1b1t and related tags.


With the sponsorship by CrowdFlower, our evening focused on
crowdsourcing for money. We had a few interesting observations about
the differences between free Wikipedia-style participation and
work-for-pay, but what was most interesting is that basic human
processes like community-building go in both places.

Among Howe's personal revelations was his encounter with the fear of
crowdsourcing. Everyone panics when they first see what crowdsourcing
is doing to his or her personal profession. For instance, when Howe
talked about the graphic design sites mentioned earlier, professional
designers descended on him in a frenzy. He played the sage, lecturing
them that the current system for outsourcing design excludes lots of
creative young talent, etc.

But even Howe, when approached by an outfit that is trying to
outsource professional writing, felt the sting of competition and
refused to help them. But he offered respect for href="">Helium, which encourages self-chosen
authors to sign up and compete for freelance assignments.

Howe is covering citizen journalism, though, a subject that Dan
Gillmor wrote about in a book that O'Reilly published href="">We the
, and that he continues to pursue at his href="">Mediactive site and a new book.

Job protection can also play a role in opposition to crowdsourcing,
because it makes it easier for people around the world to work on
local projects. (Over half the workers on href="">Mechanical Turk now
live in India. Biewald said one can't trust what workers say on their
profiles; IP addresses reveal the truth.) But this doesn't seem to
have attracted the attention of the xenophobes who oppose any support
for job creation in other countries, perhaps because it's hard to get
riled up about "jobs" that have the granularity of a couple seconds.

Crowdsourcing is known to occur, as Howe put it, in "situations of
high social capital," simply meaning that people care about each other
and want to earn each other's favor. It's often reinforced by explicit
rating systems, but even more powerful is the sense of sharing and
knowing that someone else is working alongside you. In a href="">blog
I wrote a couple years ago, I noted that competition site href="">TopCoder maintained a thriving
community among programmers who ultimately were competing with each

Similarly, the successful call center href="">LiveOps provides forums for operators
to talk about their experiences and share tips. This has become not
just a source of crowdsourced help, and not even a way to boost morale
by building community, but an impetus for quality. Operators actually
discipline each other and urge each other to greater heights of
productivity. LiveOps pays its workers more per hour than outsourcing
calls to India normally costs to clients, yet LiveOps is successful
because of its reputation for high quality.

We asked why communities of paid workers tended to reinforce quality
rather than go in the other direction and band together to cheat the
system. I think the answer is obvious: workers know that if they are
successful at cheating, clients will stop using the system and it will
go away, along with their source of income.

Biewald also explained that CrowdFlower has found it fairly easy to
catch and chase away cheaters. It seeds its jobs with simple questions
to which it knows the right answers, and warns the worker right away
if the questions are answered incorrectly. After a couple alerts, the
bad worker usually drops out.

We had a brief discussion afterward about the potential dark side of
crowdsourcing, which law professor Jonathan Zittrain covered in a talk
called href="">Minds
for Sale. One of Zittrain's complaints is that malicious actors
can disguise their evil goals behind seemingly innocuous tasks farmed
out to hundreds of unknowing volunteers. But someone who used to work
for's Mechanical Turk said people are both smarter and more
ethical than they get credit for, and that participants on that
service quickly noted any task that looked unsavory and warned each
other away.

As the name Mechanical Turk (which of course had a completely
unrelated historical origin) suggests, many tasks parceled out by
crowdsourcing firms are fairly mechanical ones that we just haven't
figured out how to fully mechanize yet: transcribing spoken words,
recognizing photos, etc. Biewald said that his firm still has a big
job persuading potential clients that they can trust key parts of the
company supply chain to anonymous, self-chosen workers. I think it may
be easier when the company realizes that a task is truly mechanical
and that they keep full control over the design of the project. But
crowdsourcing is moving up in the world fast; not only production but
control and choice are moving into the crowd.

Howe highlighted Fox News, which runs a href="">UReport site for
volunteers. The stories on Fox News' web site, according to Howe, are
not only written by volunteers but chosen through volunteer ratings,
somewhat in Slashdot style.

Musing on the sociological and economic implications of crowdsourcing,
as we did last night, can be exciting. Even though Mechanical Turk
doesn't seem to be profitable, its clients capture many cost savings,
and other crowdsourcing firms have made headway in particular
fields. Howe hails crowdsourcing as the first form of production that
really reflects the strengths of the Internet, instead of trying to
"force old industrial-age crap" into an online format. But beyond the
philosophical rhetoric, crowdsourcing is an area where a lot of
companies are making money.

May 04 2010

Four short links: 4 May 2010

  1. Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks (PNAS) -- paper comparing structure and evolution of software design (exemplified by the Linux operating system) against biological systems (in the form of the e. coli bacterium). They found software has a lot more "middle manager" functions (functions that are called and then in turn call) as opposed to biology, where "workers" predominate (genes that make something, but which don't trigger other genes). They also quantified how software and biology value different things (as measured what persists across generations of organisms, or versions of software): Reuse and persistence are negatively correlated in the E. coli regulatory network but positively correlated in the Linux call graph[...]. In other words, specialized nodes are more likely to be preserved in the regulatory network, but generic or reusable functions are persistent in the Linux call graph. (via Hacker News)
  2. Virtual Keyboards in Google Search -- rolling out virtual keyboards across all Google searches. Very nice solution to the problem of "how the heck do I enter that character on this keyboard?". (via glynmoody on Twitter)
  3. Information and Quantum Systems Lab at HP -- working on the mathematical and physical foundations for the technologies that will form a new information ecosystem, the Central Nervous System for the Earth (CeNSE), consisting of a trillion nanoscale sensors and actuators embedded in the environment and connected via an array of networks with computing systems, software and services to exchange their information among analysis engines, storage systems and end users. (via dcarli on Twitter)
  4. Turkit -- Java/JavaScript API for running iterative tasks on Mechanical Turk. (via chrismessina on Twitter)

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
Get rid of the ads (sfw)

Don't be the product, buy the product!