Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

April 17 2012

Microsoft opens up

Open Sign by dlofink, on FlickrWhile Microsoft's previous stance on open source systems is well known, it turns out there's been a serious shift as Microsoft looks to bring more non-.NET programmers into the fold.

On April 12, Jean Paoli, president of a new subsidiary of Microsoft called Microsoft Open Technologies, Inc., wrote about the new initiative. In his words, the subsidiary was created "to advance the company's investment in openness — including interoperability, open standards and open source." This is a public step toward working with open source communities and integrating technologies into Microsoft's closed initiatives, which may not be quite so closed in the future. With that in mind, below I take a look at what's new with Microsoft and open source.

While these projects provide proof that the pendulum is swinging in the open source direction, the impact for Microsoft can and will be much more resounding. New markets, programmers, and communities are at play here if this new tact goes well.

Attracting the polyglot programmers

This shift in ideology will likely help Microsoft on a number of fronts, including finding new programmers and communities. For example, Microsoft may lure developers to Windows 8 — rumored to be launching in October — by making it as easy as possible to get up and running. HTML5/JavaScript as well as C++ can be used to create Windows 8 Metro applications, and Microsoft hasn't forgotten its own .NET developers, who will use C#. The common theme you will see with the Windows 8 release and others is that Microsoft is trying to become less isolated from the rest of the programming community, many of whom are now polyglot programmers.

Hadoop's halo effect

Azure, Microsoft's cloud platform, is slowly gaining momentum as enterprises make the shift to cloud services. The key word here is "slowly." On the other hand, Hadoop, an open source Apache project that's become a central part of the big data movement, has a huge and active community that's improving the code minute by minute. Supporting Hadoop on Azure lets Microsoft incorporate the popularity and visibility of an open source project into a Microsoft initiative that needs more exposure.

A marketing signal

With a Microsoft Openness website that speaks to the relationship it has with open source technologies, and an accompanying Twitter account (@OpenatMicrosoft) with more than 6,500 followers, the Microsoft marketing team also seems to think open source exposure is important. (Side note: Gianugo Rabellino, Microsoft senior director of open source communities, and one of the people tweeting from the @OpenatMicrosoft account, will be presenting at the OSCON conference this summer.)

As Microsoft continues to see viable open source projects gain momentum, you can be sure that it will work on including ways for those languages, libraries, and frameworks to be incorporated into its current and future platforms. But the more meaningful change is that Microsoft is seeing that opening its own technologies to programmers will only make its products better, more accessible, and central to the future of programming.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Photo: Open Sign by dlofink, on Flickr

Related:

December 16 2010

Strata Week: Shop 'til you drop

Need a break from the holiday madness? You're not alone. Check out these items of interest from the land of data and see why even the big consumers face tough choices.

Does this place accept returns?

On Monday, Stack Overflow announced that they have moved the Stack Exchange Data Explorer (SEDE) off of the Windows Azure platform and onto in-house hardware.

data-explorer-screenshot.png

SEDE is an open source, web-based tool for querying the monthly data dump of Creative Commons data from its four main Q&A sites (Stack Overflow, Server Fault, Super User, and Meta) as well as other sites in the Stack Exchange family. The primary reason given (within a polite write-up by Jeff Atwood and SEDE lead Sam Saffron), was the desire to have fine-tuned control over the platform.

When you are using a [Platform-as-a-Service] you are giving up a lot of control to the service provider. The service provider chooses which applications you can run and imposes a series of restrictions. ... It was disorienting moving to a platform where we had no idea what kind of hardware was running our app. Giving up control of basic tools and processes we use to tune our environment was extremely painful.

While the support that comes with Platform-as-a-Service was acknowledged, it seems that the ability to better automate, adjust, and perpetuate processes and systems with more fine-grained control won out as a bigger convenience.



Where did you get that lovely platform?


Strata 2011Of course, one company's headache is another's dream. Netflix, a company known for playing with big data and crowdsourcing solutions "before it was cool," posted on Tuesday the four reasons they've chosen to use Amazon Web Services (AWS) as their platform and have moved onto it over the last year.

Laudably, the company states that it viewed its tremendous recent growth (in terms of both members and streaming devices) as a license to question everything in the necessary process of re-architecting. Instead of building out their own data centers, etc., they decided to answer that set of questions by paying someone else to worry about it.

Also to their credit, Netflix has enough self-awareness to know what they are and aren't good at. Building top-notch recommendation systems and providing entertainment? You betcha. Predicting customer growth and device engagement? Not so much.

How many subscribers would you guess used our Wii application the week it launched? How many would you guess will use it next month? We have to ask ourselves these questions for each device we launch because our software systems need to scale to the size of the business, every time.

Self-awareness is in fact the primary lesson in both Netflix's and Stack Exchange's platform decisions. If you feel your attention is better spent elsewhere, write a check. If you've got the time and expertise to hone your hardware, roll your own.

[Of course, Netflix doesn't go for the pre-packaged solutions every time. They also posted recently about why they love open source software, and listed among the projects they make use of and contribute back to: Hadoop, Hive, HBase, Honu, Ant, Tomcat, Hudson, Ivy, Cassandra, etc.]

With what shall we shop?

The New York Times this week released a cool group of interactive maps based on data collected in the Census Bureau's American Community Survey (ACS) from 2005 to 2009. Data is compared against the 2000 census to uncover rates of change.

[While similar to the census, the ACS is conducted every year instead of every 10 years. The ACS includes only a sampling of addresses instead of a comprehensive inventory. It covers much of the same ground on population (age, race, disability status, family relationships), but it also asks for information that is used to help make funding distribution decisions about community services and institutions.]

The Times maps explore education levels; rent, mortgage rates, and home values; household income; and racial distribution. Viewers can select among 22 maps in these four categories, and then pan and zoom to view national, state, or local trends down to the level of individual census tracts.

Above is the national view of the map that looks at change in median household income. The ACS website itself provides some maps displaying the survey numbers from the 2000 census and the 2005-2009 survey, as well as a listing of data tables.

The Times map shows the uneven way in which these numbers have gone up or down in various parts of the country, with some surprising results that are worth exploring. Note that the blue regions are places where income has dropped, and the yellow regions are places where it has increased. (No wonder a lot of us are getting creative with holiday shopping.)

If this kind of research floats your boat, check out Social Explorer, the mapping tool used to create the New York Times maps.

Even markets like to buy things

The emerging landscape of custom data markets is already shifting as Infochimps recently announced the acquisition of Data Marketplace, a start-up incubated at Y Combinator.

While Stewart Brand may be right in thinking information wants to be free, there's also enormous value to be added by aggregating, structuring, and packaging data, as well as in matching up buyers with sellers. That's the main service Data Marketplace aims to provide, particularly in the field of financial data.

At Infochimps, information is offered a la carte, and many of the site's datasets are offered for free. These include sets as diverse as "Word List - 100,000+ official crossword words (Excel readable)", "Measuring Worth: Interest Rates - US & UK 1790-2000", and "Retrosheet: Game Logs (play-by-play) for Major League Baseball Games." Data Marketplace is a bit different, in that it allows users to enter requests for data (with a deadline and budget, if desired) and then matches up would-be buyers with data providers.

Infochimps has said that Data Marketplace, which is less than a year old, will continue to operate as a standalone site, although its founders Steve DeWald and Matt Hodan will depart for new projects.

If you're interested in the burgeoning business of aggregated datasets, be sure to check out the Data Marketplaces panel I'll be moderating at Strata in February.

Not yet signed up for Strata? Register now and save 30% with the code STR11RAD.

October 07 2010

Developing intuitions about data

In The laws of information chemistry I mentioned that my local high school uses a PDF file to publish the school's calendar of events. Let's look at some different ways to represent the calendar entries for Oct 6, 2010. First I'll divide these representations into two major categories: "What People See," and "What Computers See." Then I'll discuss how the various formats serve various purposes.

Category 1: What People See

Here's a piece of the PDF file for the week of Oct 4, 2010.

How the PDF looks to a person

Fig. 1a: How the PDF looks to a person

And here's how the same entries might look in Google Calendar (or in any other calendar program).


Fig. 1b: How the calendar looks to a person

Category 2: What Computers See

The PDF file describes fonts and layout in a highly structured way. But the calendar's data -- dates, times, descriptions -- only lives in free-form text. Computers use it to enable people to read or print that text.



10/6



-Junior class NECAP testing info. Meeting block 4 (aud.)

-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm (Davenson/Sintros)

-Field trip: Physics to Arnone’s 7:40-11 a.m. (Lybarger/Romano) List will be sent.

-Senior workshop: “Tips & tricks for writing your college essay 8:05-8:45 & 1:200-2:02 (GCR)

-New teacher workshop 2:15-3:00 p.m. (PCR) “Guidance & Special Ed. Responsibilities”


Fig. 2a: How the data in the PDF file looks to a computer


When your browser renders the calendar, it sees a mixture of HTML and JavaScript. Computers use that mixture to enable people to read, print, and also interact with the text.

<TR class="lv-row lv-newdate lv-firstevent lv-alt">

<TH class=lv-datecell rowSpan=5><A class=lv-datelink 
href="javascript:void(Vaa('20101006'))">Wed Oct 6</A></TH>

<TD class="lv-eventcell lv-status"> </TD>

<TD class="lv-eventcell lv-time"><SPAN class=lv-event-time
onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;">All
day</SPAN></TD>

<TD class="lv-eventcell lv-titlecell">
<DIV id=listviewzYzFmYT...b2tAZw20101006 class=lv-zippy
onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;"></DIV>

<DIV class=lv-event-title-line><A style="COLOR: #1f753c" class=lv-event-title
onmousedown="Waa(event,'listview','YzFmYT...b2tAZw','20101006');return false;"
href="javascript:void(0)">-Junior class NECAP testing info. Meeting block 4
<SPAN dir=ltr>(aud.)</SPAN></A> </DIV>

Fig. 2b: How the HTML looks to a computer



A calendar application or service that knows how use a standard format called iCalendar will receive a structured representation of the data. It relies on that structure to identify, recombine, and exchange the dates, times, and descriptions.

BEGIN:VCALENDAR

PRODID:-//Google Inc//Google Calendar 70.9054//EN

VERSION:2.0

BEGIN:VEVENT

DTSTART:20101006T113000Z

DTEND:20101006T190000Z

DTSTAMP:20101005T172506Z

UID:bccvmn5aooodokincjbgl8crc0@google.com

CREATED:20101005T161914Z

DESCRIPTION:

LOCATION:

SUMMARY:-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm (Davenso

n/Sintros)

END:VEVENT



Fig. 2c: How the iCalendar feed looks


If a proposed format called xCalendar is approved as a standard, and is widely adopted by calendar applications and services, then calendar applications or services might also use that format to identify, recombine, and exchange dates, times, and descriptions.



<icalendar xmlns="urn:ietf:params:xml:ns:icalendar-2.0">
<vcalendar>
<properties>
<prodid>
<text>-//Google Inc//Google Calendar 70.9054//EN</text>
</prodid>
<version>
<text>2.0</text>
</version>
</properties>
<components>
<vevent>
<properties>
<dtstamp>20101005T172506Z</dtstamp>
<dtstart>20101006T113000Z</dtstart>
<dtend>20101006T190000Z</dtend>
<uid>
<text>bccvmn5aooodokincjbgl8crc0@google.com</text>
</uid>
<summary>
<text>-Rain date for AP Env. Sci. trip to Monadnock 7:30 am-3 pm
(Davenson/SintrosEvent #2</text>
</summary>
</properties>
</vevent>
</components>
</vcalendar>
</icalendar>

Fig. 2d: How an xCalendar feed might look



Note that Fig. 2c (iCalendar) and Fig 2d (xCalendar) look very different. The iCalendar format uses lines of plain text to represent name:value pairs. The xCalendar format use a package of nested XML entities to represent the same data. Technical experts can, and do, endlessly debate the pros and cons of these different approaches. But for our purposes here, the key observations are:

  • Fig. 2c and Fig. 2d contain the same data


  • Computers can reliably extract that data

  • Computers can transform either format into the other without loss of fidelity

  • Computers can also transform either format into one that's more directly useful to people -- e.g., HTML or PDF

It's also worth noting that this simple name:value technique, which has been the Internet calendar standard for over a decade, is broadly useful. Curators of elmcity calendar hubs, for example, follow a convention for representing name:value pairs as tags, attached to Delicious bookmarks, that have the form name=value. A similar convention enables any calendar event, made by any calendar program, to specify the URL for the event and the categories that it belongs to. In this week's companion article on answers.oreilly.com I show how to extract these name:value pairs from free text.

A taxonomy of representations and purposes

Let's chart these representations and arrange them according to purpose.




What people see
Why?
What computers see
Why?





Fig. 1a: pdf


To view and print



Fig 2a: pdf


To enable people to view and and print






Fig. 1b: html


To view, print, and interact



Fig 2b: html


To enable people to view, print, and interact




 

 



Fig 2c: iCalendar


To enable data to flow reliably and recombine easily




 


 



Fig 2d: xCalendar


To enable data to flow reliably and recombine easily


To most people, all four items in the What Computers See column are roughly equivalent. They're understood to be computer files of one sort or another. But when computers use these files on our behalf, they use them in very different ways. The first two uses enable people to read, print, and interact online. The latter two enable computers to exchange data without loss of fidelity, so that other people can read, print, and interact online.

The laws of information chemistry say that if we want to exchange data, we must provide it in a format that's useful for that purpose. In this example the PDF and HTML formats aren't; the iCalendar and xCalendar formats are. To most people it's not obvious why that's so. Our brains are such powerful pattern recognizers, and we know so much about the world in which the patterns occur, that we can look at Fig. 2a and see that the text clearly implies a structure involving dates, times, titles, and descriptions. Computers can't do that so easily or so well.

Computers are, of course, getting smarter all the time. Google Calendar's Quick Add feature is a perfect example. I used it to create the example shown in Fig. 1b, and it did a great job of parsing out the times and titles of the events. But that was only possible because I inserted the events, one at time, into a container that Google Calendar understood to represent Wed Oct 6. It wouldn't be able to import the original free-form text that was the original source for the PDF file. No other calendar program could either.

The surprising difficulty of structured information

It's counter-intuitive that computers don't recognize structure easily or reliably. But so are many other things. For example:

You have $100. It grows by 25%, then shrinks by 25%. Do you end up with more or less?

You can live a long time without ever developing an intuition that the final amount is less. And you may be profoundly harmed because you lack that intuition. If you have it, you most likely didn't acquire it all by yourself. Either somebody taught it to you, or nobody did.

Although our sample PDF file contains no structured representation of the events that it exists to convey, it does contain some other structured data:






Title Microsoft Word - weekly draft


Made_by
Word


Created_with
Mac OS X 10.4.11 Quartz PDFContext

From this we learn that that calendar originates in Microsoft Word. Why Word instead of a calendar program? Available cloud-based applications include Google Calendar and Hotmail Calendar. On the Mac desktop where the document originated, there's Apple iCal. If one of these alternatives were even considered, a number of valid concerns would arise:

  1. It's cumbersome to enter data into a calendar program's input fields; it's much easier and quicker to type into a Word table

  2. The document doesn't only contain structured data, it is also a textual narrative. Calendar programs don't flexibly accomodate narrative.

  3. The webmaster knows how to post a PDF, but wouldn't know what to do with dual outputs from a calendar program (one for humans to read, another for computers to process).

And if alternatives were considered, we could discuss those concerns:

  1. Yes, it is more cumbersome to enter data into a calendar program. But do we want students and teachers and parents to be able to pull these events into their own calendars? Do we want the events to also be able to flow automatically to community-wide calendars? If so, these are big payoffs for a fairly small investment of extra effort. And by doing things this way, we'll demonstrate the 21st-century skills that we say our students need to learn and apply.
  2. Yes, it's true that calendar programs don't accomodate narrative. But we're publishing to the web. We can use documents and links to build a context that includes: the calendar in an HTML format that people can read, print, and interact with; the calendar in another format that can syndicate to other calendars; narrative related to the calendar.

  3. Yes, but the webmaster needn't even be tasked with this chore. Various tools -- some that we already have and use, others that are freely available -- enable us to publish the desired formats ourselves.

Since alternatives are almost never considered, though, the ensuing discussion almost never happens. Why not? Key intuitions are missing. Some kinds of computer files have different properties than others, and thus serve different purposes. Structured representation of data is one such property. If we are trying to put data onto the web, and if we want others to have the use of that data, and if we hope it will flow reliably through networks to all the places where it's needed, then we ought to consider how the files we choose to publish do, or don't, respect that property.

Nobody is born knowing this stuff. We need to learn it. Schools aren't the only source of instruction. But they ought to teach core principles that govern the emerging web of people, data, and services. And they ought to cultivate intuitions about when, why, and how to apply those principles.



Related:


Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!

Schweinderl