Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

August 29 2013

Four short links: 30 August 2013

  1. intention.jsmanipulates the DOM via HTML attributes. The methods for manipulation are placed with the elements themselves, so flexible layouts don’t seem so abstract and messy.
  2. Introducing Brick: Minimal-markup Web Components for Faster App Development (Mozilla) — a cross-browser library that provides new custom HTML tags to abstract away common user interface patterns into easy-to-use, flexible, and semantic Web Components. Built on Mozilla’s x-tags library, Brick allows you to plug simple HTML tags into your markup to implement widgets like sliders or datepickers, speeding up development by saving you from having to initially think about the under-the-hood HTML/CSS/JavaScript.
  3. F1: A Distributed SQL Database That Scalesa distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.
  4. Looking Inside The (Drop)Box (PDF) — This paper presents new and generic techniques, to reverse engineer frozen Python applications, which are not limited to just the Dropbox world. We describe a method to bypass Dropbox’s two factor authentication and hijack Dropbox accounts. Additionally, generic techniques to intercept SSL data using code injection techniques and monkey patching are presented. (via Tech Republic)

April 24 2013

Four short links: 30 April 2013

  1. China = 41% of World’s Internet Attack Traffic (Bloomberg) — numbers are from Akamai’s research. Verizon Communications said in a separate report that China accounted for 96 percent of all global espionage cases it investigated. One interpretation is that China is a rogue Internet state, but another is that we need to harden up our systems. (via ZD Net)
  2. Open Source Cannot Live on Donations Alone — excellent summary of some of the sustainability questions facing open source projects.
  3. China Startups: The Gold Rush (Steve Blank) — dense fact- and insight-filled piece. Not only is the Chinese ecosystem completely different but also the consumer demographics and user expectations are equally unique. 70% of Chinese Internet users are under 30. Instead of email, they’ve grown up with QQ instant messages. They’re used to using the web and increasingly the mobile web for everything, commerce, communication, games, etc. (They also probably haven’t seen a phone that isn’t mobile.) By the end of 2012, there were 85 million iOS and 160 million Android devices in China. And they were increasing at an aggregate 33 million IOS and Android activations per month.
  4. Calculating Rolling Cohort Retention with SQL — just what it says. (via Max Lynch)
Sponsored post

October 21 2011

Developer Week in Review: Talking to your phone

I've spent the last week or so getting up to speed on the ins and outs of Vex Robotics tournaments since I foolishly volunteered to be competition coordinator for an event this Saturday. I've also been helping out my son's team, offering design advice where I could. Vex is similar to Dean Kamen's FIRST Robotics program, but the robots are much less expensive to build. That means many more people can field robots from a given school and more people can be hands-on in the build. If you happen to be in southern New Hampshire this Saturday, drop by Pinkerton Academy and watch two dozen robots duke it out.

In non-robotic news ...

Why Siri matters

SiriIt's easy to dismiss Siri, Apple's new voice-driven "assistant" for the iPhone 4S, as just another refinement of the chatbot model that's been entertaining people since the days of ELIZA. No one would claim that Siri could pass the Turing test, for example. But, at least in my opinion, Siri is important for several reasons.

On a pragmatic level, Siri makes a lot of common smartphone tasks much easier. For example, I rarely used reminders on the iPhone and preferred to use a real keyboard when I had to create appointments. But Siri makes adding a reminder or appointment so easy that I have made it pretty much my exclusive method of entering them. It also is going to be a big win for drivers trying to use smartphones in their cars, especially in states that require hands-free operations.

I suspect Siri will also end up being a classic example of crowdsourcing. If I were Apple, I would be capturing every "miss" that Siri couldn't handle and looking for common threads. Since Siri is essentially doing natural language processing and applying rules to your requests, Apple can improve Siri progressively by adding the low-hanging fruit. For example, at the moment, Siri balks at a question like, "How are the Patriots doing?" I'd be shocked if it fails to answer that question in a year since sports scores and standings will be at the heart of commonly asked questions.

For developers, the benefits of Siri are obvious. While it's a closed box right now, if Apple follows its standard model, we should expect to see API and SDK support for it in future releases of iOS. At the moment, apps that want voice control (and they are few and far between) have to implement it themselves. Once apps can register with Siri, any app will be able to use voice.

Velocity Europe, being held Nov. 8-9 in Berlin, will bring together the web operations and performance communities for two days of critical training, best practices, and case studies.

Save 20% on registration with the code RADAR20

Can Open Office survive? logoLong-time WIR readers will know that I'm no fan of how Oracle has treated its acquisitions from Sun. A prime example is OpenOffice. In June, OpenOffice was spun off from Oracle, and therefore lost its allowance. Now the OpenOffice team is passing around the hat, looking for funds to keep the project going.

We need to support Open Office because it's the only project that really keeps Microsoft honest as far as providing open standards access to Microsoft Office products. It's also the only way that Linux users can deal with the near-ubiquitous use of Office document formats in the real world (short of running Office in a VM or with Wine.)

The revenge of SQL

The NoSQL crowd has always had Google App Engine as an ally since the only database available to App Engine apps has been the App Engine Datastore, which (among other things) doesn't support joins. But much as Apple initially rejected multitasking on the iPhone (until it decided to embrace it), Google appears to have thrown in the towel as far as SQL goes.

It's always dangerous to hold an absolutist position (with obvious exceptions, such as despising Jar Jar Binks). SQL may have been overused in the past, but it's foolish to reject SQL altogether. It can be far too useful at times. SQL can be especially handy, as an example, when developing pure REST-like web services. It's nice to see that Google has taken a step back from the edge. Or, to put it more pragmatically, that it listens to its customer base on occasion.

Got news?

Please send tips and leads here.


October 18 2011

October 10 2011

September 16 2011

Four short links: 16 September 2011

  1. A Quick Buck by Copy and Paste -- scorching review of O'Reilly's Gamification by Design title. tl;dr: reviewer, he does not love. Tim responded on Google Plus. Also on the gamification wtfront, Mozilla Open Badges. It talks about establishing a part of online identity, but to me it feels a little like a Mozilla Open Gradients project would: cargocult-confusing the surface for the substance.
  2. Google + API Launched -- first piece of a Google + API is released. It provides read-only programmatic access to people, posts, checkins, and shares. Activities are retrieved as triples of (subject, verb, object), which is semweb cute and ticks the social object box, but is unlikely in present form to reverse Declining numbers of users.
  3. Cube -- open source time-series visualization software from Square, built on MongoDB, Node, and Redis. As Artur Bergman noted, the bigger news might be that Square is using MongoDB (known meh).
  4. Tenzing -- an SQL implementation on top of Map/Reduce. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries. (via Raphaël Valyi)

July 29 2011

Four short links: 29 July 2011

  1. SQL Injection Pocket Reference (Google Docs) -- just what it sounds like. (via ModSecurity SQL Injection Challenge: Lessons Learned)
  2. isostick: The Optical Drive in a Stick (KickStarter) -- clever! A USB memory stick with drivers that emulate optical drives so you can boot off .iso files you've put on the memory stick. (via Extreme Tech)
  3. CrowdDB: Answering Queries with Crowdsourcing (Berkeley) -- CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. (via Big Data)
  4. The DIY Electronic Medical Record (Bryce Roberts) -- I had a record of my daily weight, my exercising (catalogued by type), my walking, my calories burned and now, with the addition of Zeo, my nightly sleep patterns. All of this data had been passively collected with little to no manual input required from me. Total investment in this personal sensor network was in the range of a couple hundred dollars. And, as I rummaged through my data it began to hit me that what I’ve really been doing is creating my own DIY Electronic Medical Record. The Quantified Self is about more than obsessively cataloguing your bowel movements in low-contrast infographics. I'm less enthused by the opportunities to publicly perform private data, a-la the wifi body scale, than I am by opportunities to gain personal insight.

December 14 2010

Big data, but with a familiar face

Strata Conference 2011To prepare for O'Reilly's upcoming Strata Conference, we're continuing our series of conversations with some of the leading innovators working with Big Data and analytics. Today, we have a brief chat with Martin Hall, co-founder, president, and CEO of Karmasphere.

Karmasphere is one of several companies shipping commercial tools that make Big Data more accessible to developers and analysts. Hall says the company's products focus on making the data accessible by integrating with tools and languages familiar to developers — like SQL.

"We're focused on providing a new kind of software for working with Big Data stored in Hadoop clusters. In particular, tools for developers and analysts, and doing it in such a way that they get familiar tools and familiar environments and can quickly be very productive analyzing and transforming data stored in Hadoop clusters."

Integrating big data into business will be discussed at the Executive Summit at the upcoming Strata Conference (Feb. 1-3, 2011). Save 30% on registration with the code STR11RAD.

Karmasphere Studio is the company's main product for developers. It's a graphical interface for programming and debugging MapReduce jobs, and it integrates within IDEs like Eclipse and NetBeans. The company recently announced Karmasphere Analyst, which offers a familiar SQL interface for querying Hadoop clusters.

Hall says businesses typically dip their toe into Big Data with a small R&D cluster. "Once they see success with that, they deploy it into production. Once they have it in production, they're looking to connect it with other data sources."

Over the past 18 months, customers who've reached that threshold have been asking Karmasphere for more and better visualization tools, not only at the front end where decision-makers need insights, but for developers who need "more ability to see what's going on in the cluster, to see the progress of their jobs, to analyze and debug what' going on." Hall says they're working on more hook-ins with existing visualization packages.

"We don't expect people who are embracing Hadoop to have to sweep away everything they've invested in, in terms of skill sets, hardware, or software," Hall says. "It's an integration story."

You'll find the full interview in the following video:

November 17 2010

Developer Week in Review

Here's what's new for the trendy developer this week:

Java's future on Apple: Slightly less in doubt

Last week, it looked like Apple was all "You're not welcome here, Java." In the changeable world that is Jobsland, this week Apple was offering to marry the language, reiterating their support for Java in OS X, and indicating that they would be supplying code and resources to the OpenJDK project.

As I've noted before, this makes sense for Apple, because it gets them out of the JVM business, and makes Oracle the one-stop shopping solution for all your JDK and JRE needs. It also means that the Mac can be added as a regression-tested target for a new version of Java, hopefully avoiding the kind of Java versioning snafus that rendered IBM's SPSS (or is it PAWS this week?) statistics package broken for the last month or so.

Apple makes nice with Java and lets the Google Voice app hit the iPhone in the same week. Could peace in the Middle East be next?

Editorial: NPEs must die

Staying on a Java theme, probably the most common error messages I see on websites (after the all-too-common IIS MSQL database) are Java Null Pointer Exceptions (NPE). If you're not a Java person, an NPE is what you get when you try to call a method on an object, but the variable holding the object contains null, rather than a real object.

NPEs are to Java what bad pointer references are to C. The difference is that, unlike C, an NPE in Java is usually a non-fatal event. In a web server container like Tomcat, an uncaught NPE ends up as status code 500 HTTP backtrace spam.

NPEs are a direct result of not checking the validity of arguments and data before operating on it. Too many Java programmers are lazy, and assume because an NPE doesn't crash things, they can just let it trickle up the stack until someone catches it.

This is just plain bad coding. An NPE tells you almost nothing about what really went wrong. Instead, Java developers should check objects before invoking methods on them, and throw a more specific and meaningful runtime exception, such as an IllegalArgumentException, with details about what value was incorrect.

The same holds true for other languages, of course. It's always a best practice to verify values external to the current context. Java developers just seem to be especially bad about it. So remember folks, trust but verify!

And speaking of MSQL

This must be my week for segues, because speaking of MSQL, Microsoft released a CTP version of SQL Server 2011 (Denali, they call it). CTP is Community Technology Preview, which is kinda Microsoftease for "beta."

Love it or hate it, MSQL powers a lot of the web, and a new turn of SQL Server is a big deal for that part of the world that runs on 96-octane .NET fuel. Database technology is pretty much a two-horse game these days, now that Oracle owns the former third horse (MySQL.) A new SQL Server will hopefully raise the bar enough to get some new and interesting things out of Oracle's two database projects.

Another day on the JavaScript racetrack

Probably the most commonly programmed platform these days is JavaScript on the browser. As a result, browser vendors take their JavaScript performance numbers very seriously. For a while, Chrome has been the target to beat. Ok, technically, Opera has been posting better numbers lately, but Opera has never managed to build significant market share. Let's just say that Chrome has had the best numbers among the big 4 (IE, Firefox, Chrome, Safari).

Mozilla evidently decided not to take things lying down. The latest version of the Firefox 4 beta has turned in speed numbers the Road Runner would find respectable. How long will it take Google to respond? Or, will the new IE beta come from behind to snatch the crown of JavaScript performance? Personally, I'd prefer that Google fix Chrome so Flash doesn't break on the Mac if you look at it funny ...

That's it for this week. Suggestions are always welcome, so please send tips or news here.

September 16 2010

Four short links: 16 September 2010

  1. jsTerm -- ANSI-capable telnet terminal built in HTML5 with Javascript, Websocket, and Node.js. (via waxpancake on Twitter)
  2. MySQL EXPLAINer -- visualize the output of the MySQL EXPLAIN command. (via eonarts on Twitter)
  3. Google Code University -- updated with new classes, including C++ and Android app development.
  4. Cloudtop Applications (Anil Dash) -- Anil calling "trend" on multiplatform native apps with cloud storage. Another layer in the Web 2.0 story Tim's been telling for years, with some interesting observations from Anil, such as: Cloudtop apps seem to use completely proprietary APIs, and nobody seems overly troubled by the fact they have purpose-built interfaces.

March 23 2010

Joe Stump on data, APIs, and why location is up for grabs

I recently had a long conversation with Joe Stump, CTO of SimpleGeo, about location, geodata, and the NoSQL movement. Stump, who was formerly lead architect at Digg, had a lot to say. Highlights are posted below. You can find a transcript of the full interview here.

Competition in the geodata industry:

I personally haven't seen anybody that has come out and said, "We're actively indexing millions of points of data. We're also offering storage and we're giving tools to leverage that. I've seen a lot of fragmentation." Where SimpleGeo fits is, I really think, at the crossroads or the nexus of a lot of people that are trying to figure out this space. So ESRI is a perfect example. They have a lot of data. Their stack is enormous. They answer everything from logistics down to POI things, but they haven't figured out the whole cloud, web, infrastructure, turn-key approach. They definitely haven't had to worry about real time. How do you index every single tweet and every single Twitter photo without blowing up? With the data providers, there's been a couple of people that are coming out with APIs and stuff.

I think largely, things are up for grabs. I think one of the issues that I see is as people come out with their location APIs here, like NAVTEQ is coming out with an API, as a developer, in order to do location queries and whatnot, especially on the mobile device, I don't want to have to do five different queries, right? Those round trips could add up a lot when you're on high latency slow networks. So while I think that there's a lot of people that are orbiting the space and I know that there's a lot of people that are going to be coming out with location and geo offerings, a lot of people are still figuring out and trying to work their way into how they're going to use location and how they're going to expose it.

How SimpleGeo stores location data:

O'Reilly Where 2010 Conference

The way that we've gone about doing that is we' actually have two clusters of databases. We've essentially taken Cassandra, the NoSQL relational store that Facebook made, and we've built geospatial features into it. The way that we do that is we actually have an index cluster and then we have a records cluster. The records cluster is very mundane, it's used for getting the actual data. The index cluster, however, we have tackled that in a number of different ways. But the general idea is that those difficult computations that you're talking about are done on write rather than read. So we precompute a number of scenarios and add those to different indexes. And then based on the query, use the most appropriate precomputed index to answer those questions.

We do that in a number of different ways. We do that through very careful key construction based on a number of different algorithms that are developed publicly and a couple that we developed internally. One of the big problems with partitioning location data is that location data has natural density problems. If you partition based on UTM or zip code or country or whatever, servers that are handling New York City are going to be overloaded. And servers that are handling South Dakota are going to be bored out of their minds. We basically have answered that in a couple of different layers. And then we also have some in-process stuff that we do before serving up the request.

The virtues of keeping your data API simple:

We started out with what we considered to be the most basic widely-needed use case for developers, which is simply "my users here tell me about points of data that are within a certain radius of where my user's sitting." And we've been slowly growing indexes from there.

Our most basic index is that my user is at this lat/long, tell me about any point that I've told you about previously, within one kilometer. That's a very basic query. We do some precomputing on the write to help with those types of queries. The next level up is that my user was here a week ago and would like to see stuff that was produced or put into the system a week ago. So we've, again, done some precomputing on the indexes to do temporal spatial queries.

A lot of users, once they see a storage thing in the cloud, they inevitably want to be able to do very open-ended abstract queries like "show me all pieces of data that are in this zip code where height is this and age is greater than this, ordered by some other thing." We push back on that. Basically, we're not a generic database in the cloud. We're there specifically to help you manage your geodata.

Why NoSQL is gaining in popularity:

I think that this is in direct response to the relational database tool chain failing, and failing catastrophically at performing in large scale, real-time environments. The simple fact is that creating a relational database that's spread across 50 servers is not an easy thing to build or manage. There's always some sort of app logic you have to build on top of it. And you have to manage the data and whatnot.

If you go into almost any high performance shop that's working at a pretty decent scale, you'll find that they avoid doing foreign key constraints because they'll slow down write. They're partitioning their data across multiple servers, which drastically increases their overhead and managing data. So I think really why you're seeing an uptake in NoSQL is because people have tried that tool chain; it's not evolving. It basically falls over in real-time environments.

The role of social networking in the demise of SQL:

I bet if you track the popularity of social, you'll almost identically track the popularity of NoSQL. Web 1.0 was extremely easy to scale because there was a finite amount of content, highly cacheable, highly static. There was just this rush to put all of the brick and mortar stuff online. So scaling, for instance, an online catalog is not that difficult. You've got maybe 50,000 products. Put it in memcached. MySQL can probably handle those queries. You're doing maybe an order every second or something. It's high value and not a lot of volume. And then, of course, during Web 2.0, we had this bright idea to hand content creation over to the masses. If you draw a diagram and one circle is users and one circle is images, scaling out a whole bunch of users and scaling out a whole bunch of pictures is not too difficult, because you run into the same thing where I need to do a primary key look-up and then I need to cache in memcached because people don't edit their user's data that often. And once they upload a photo, they pretty much never change that.

The problem comes when you intersect those social objects. The intersection in that Venn diagram, which is a join in SQL, falls over pretty quickly. You don't need a very big dataset. Even for a fairly small website, MySQL tends to fall over pretty quickly on those big joins. And most of the NoSQL stuff that people are using, they've found that if you're just doing a primary key look up 99 percent of the time on some table, you don't need to use MySQL and worry about managing that, when you can stuff it into Cassandra or some other NoSQL DB because it's basically just a key value store.

How NoSQL simplifies data administration:

Essentially, there are a lot of people out there that are "using MySQL," but they're using it in a very, very NoSQL manner. Like at Digg, for instance, joins were verboten, no foreign key constraints, primary key look-ups. If you had to do ranges, keep them highly optimized and basically do the joins in memory. And it was really amazing. For instance, we rewrote comments about a year-and-a-half ago, and we switched from doing the sorting on a MySQL front to doing it in PHP. We saw a 4,000 percent increase in performance on that operation.

There's just so many things where you have to basically use MySQL as a NoSQL store. What people have found is, rather than manage these cumbersome MySQL clusters, with NoSQL they get those basics down. With MySQL, if I wanted to spread data across five servers, I'd have to create an application layer that knows that every user that starts with "J" goes to this server; every user that starts with "V" goes over here. Whereas with Cassandra, I just give it token ranges and it manages all of that stuff internally. And with MySQL, I'd have to handle something like, "Oh, crap. What if that server goes down?" Well, now I'm going to have three copies across those five servers. So I build more application logic. And then I build application logic to migrate things. Like what happens if I have three Facebook users that use up a disproportionate amount of resources? Well, I've got to move this whale off this MySQL server because the load's really high. And if I moved him off of it, it would be nothing. And then you have to make management tools for migrating data and whatnot.

Cassandra has all of that stuff built in. All of the annoying things that we had at the beginning, NoSQL has gotten those things really, really well. Coming from having been in a shop that has partitioned MySQL, Cassandra is a Godsend.

December 10 2009

Four short links: 10 December 2009

  1. Scriblio -- open source CMS and catalogue built on WordPress, with faceted search and browse. (via titine on Delicious)
  2. Useful Temporal Functions and Queries -- SQL tricksies for those working with timeseries data. (via mbiddulph on Delicious)
  3. Optimal Starting Prices for Negotiations and Auctions --Mind Hacks discussion of a research paper on whether high or low initial prices lead to higher price outcomes in negotiations and online auctions. Many negotiation books recommend waiting for the other side to offer first. However, existing empirical research contradicts this conventional wisdom: The final outcome in single and multi-issue negotiations, both in the United States and Thailand, often depends on whether the buyer or the seller makes the first offer. Indeed, the final price tends to be higher when a seller (who wants a higher price and thus sets a high first offer) makes the first offer than when the buyer (who offers a low first offer to achieve a low final price) goes first.
  4. WiFi Science History -- Australian scientist studies black holes in the 70s, has to develop a way of piecing together signals that have been distorted as they travel through space. Realizes, when he starts playing with networked computers in the late 80s, that this same technique would let you "cut the wires". A decade later it emerged as a critical part of wireless networking. As Aaron Small says, it shows the value of basic research, where you don't have immediate applications in mind and can't show short-term deliverables or an application to a current high-value problem.

November 11 2009

Counting Unique Users in Real-time with Streaming Databases

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from the relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:

Recognizing that "data is moving until it gets stored", the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.

Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and uses the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.


Once companies can do simple counts and averages in real-time, the next step is to use real-time information for predictive analytics. Truviso has customers using their system for "on-the-fly predictive modeling".

The other major enhancement in this release is a major step towards parallel processing. Truviso's new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso's parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.

[For my previous posts on real-time analytic tools see here and here.]

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.
No Soup for you

Don't be the product, buy the product!

YES, I want to SOUP ●UP for ...