Newer posts are loading.
You are at the newest post.
Click here to check if anything new just came in.

December 16 2013

Four short links: 17 December 2013

  1. WebGraph a framework for graph compression aimed at studying web graphs. It provides simple ways to manage very large graphs, exploiting modern compression techniques. (via Ben Lorica)
  2. Learn to Program with Minecraft PluginsYou’ll need to add features to the game itself: learn how to build plugins for your own Minecraft server using the Java programming language. You don’t need to know anything about programming to get started—-this book will teach you everything you need to know! Shameless Christmas stocking bait! (via Greg Borenstein)
  3. In Search of Perfection, Young Adults Turn to Adderall at Work (Al Jazeera) — “Adderall is just the tip of the iceberg,” Essig said. “There are lots more drugs coming down the pike. The way we set up our cultural model for dealing with psychologically performance-enhancing drugs is a real serious question.”
  4. Explain Shell — uses parsed manpages to explain a shell commandline. (via Tracy K Teal)

February 17 2012

The stories behind a few O'Reilly "classics"

This post originally appeared in Tim O'Reilly's Google+ feed.

It's amazing to me how books I first published more than 20 years ago are still creating value for readers. O'Reilly Media is running an ebook sale for some of our "classics."

vi and Vim"Vi and Vim" is an updated edition of a book we first published in 1986! Linda Lamb was the original author; I was the editor, and added quite a bit of material of my own. (In those days, being the "editor" for us really meant being ghostwriter and closet co-author.) I still use and love vi/vim.

"DNS and Bind" has an interesting back story too. In the late '80s or early '90s, I was looking for an author for a book on smail, a new competitor to sendmail that seemed to me to have some promise. I found Cricket Liu, and he said, "what I really want to write a book about is Bind and the Domain Name System. Trust me, it's more important than smail." The Internet was just exploding beyond its academic roots (we were still using UUCP!), but I did trust him. We published the first edition in 1992, and it's been a bestseller ever since.

"Unix in a Nutshell" was arguably our very first book. I created the first edition in 1984 for a long-defunct workstation company called Masscomp; we then licensed it to other companies, adapting it for their variants of Unix. In 1986, we published a public edition in two versions: System V and BSD. The original editions were inspired by the huge man page documentation sets that vendors were shipping at the time: I wanted to have something handy to look up command-line options, shell syntax, regular expression syntax, sed and awk command syntax, and even things like the ascii character set.

The books were moderately successful until I tried a price drop from the original $19.50 to $9.95 as an experiment, with the marketing headline "Man bites dog." I told people we'd try the new price for six months, and if it doubled sales, we'd keep it. Instead, the enormous value proposition increased sales literally by an order of magnitude. At the book's peak, we were selling tens of thousands of copies a month.

Every other "in a nutshell" book we published derived from this one, a product line that collectively sold millions of copies, and helped put O'Reilly on the map.

"Essential System Administration" is another book that dates back to our early days as a documentation consulting company. I wrote the first edition of this book for Masscomp in 1984; it might well be the first Unix system administration book ever written. I had just written a graphics programming manual for Masscomp, and was looking for another project. I said, "When any of us have any problems with our machines, we go to Tom Texeira. Where are our customers going to go?" So I interviewed Tom, and wrote down what he knew. (That was the origin of so many of our early books — and the origin of the notion of "capturing the knowledge of innovators.")

I acquired the rights back from Masscomp, and licensed the book to a company called Multiflow, where Mike Loukides ran the documentation department. Mike updated the book. Æleen Frisch, who was working for Mike, did yet another edition for Multiflow, and when the company went belly up, I acquired back the improved version (and hired Mike as our first editor besides me and Dale). He signed Æleen to develop it as a much more comprehensive book, which has been in print ever since.

"Sed and Awk" has a funny backstory too. It was one of the titles that inspired the original animal designs. Edie Freedman thought Unix program names sounded like weird animals, and this was one of the titles she chose to make a cover for, even though the book didn't exist yet. We'd hear for years that people knew it existed — they'd seen it. Dale Dougherty eventually sat down and wrote it, mostly because he loved awk but also just to satisfy those customers who just knew it existed.

(Here's a brief history of how Edie came up with the idea for the animal book covers.)

Unix Power ToolsAnd then there's "Unix Power Tools." In the late '80s, Dale had discovered hypertext via Hypercard, and when he discovered Viola and the World Wide Web, that became his focus. We had written a book called "Unix Text Processing" together, and I was hoping to lure him back to writing another book that exercised the hypertext style of the web, but in print. Dale was working on GNN by that time and couldn't be lured onto the project, but I was having so much fun that I kept going.

I recruited Jerry Peek and Mike Loukides to the project. It was a remarkable book both in being crowdsourced — we collected material from existing O'Reilly books, from saved Usenet posts, and from tips submitted by customers — and in being cross-linked like the web. Jerry built some great tools that allowed us to assign each article a unique ID, which we could cross-reference by ID in the text. As I rearranged the outline, the cross-references would automatically be updated. (It was all done with shell scripts, sed, and awk.)

Lots more in this trip down memory lane. But the fact is we've kept the books alive, kept updating them, and they are still selling, and still helping people do their jobs, decades later. It's something that makes me proud.

See comments and join the conversation about this topic at Google+.

October 30 2011

On Dennis Ritchie: A conversation with Brian Kernighan

The phrase "Kernighan and Ritchie" has entered computing jargon independently of the lexical tokens from which it is constituted. I talked on Friday with Brian Kernighan about Dennis Ritchie, who sadly passed away two weeks ago at the age of 70. Brian had gotten to know me a bit when he contributed a chapter on regular expressions to the O'Reilly book Beautiful Code. He said, "It's just remarkable how much we all still depend on things Dennis created." Kernighan went on to say that Ritchie was not self-promotional, but just quietly went about doing the work he saw needed to be done.

Plenty of ink has been devoted over the past 40 years to the impacts of the C language and of the Unix operating system, both of which sprang to a great extent from Ritchie's work, and more ink on the key principles they raised that have enabled key advances in computing: portability, encapsulation, many small programs cooperating through pipelining, a preference for representing data in text format, and so on. So I did not concentrate on these hoary, familiar insights, but talked to Kernighan about a few other aspects of Ritchie's work.

More than bits of C remain

It's noteworthy that Android added C support through the Native Development Kit. They needed the support in order to offer 3D graphics through the OpenGL libraries, which were written in C, but the NDK has proven very popular with developers who embellish their applications with device-specific code that has to be written in C.

Apple's Cocoa and iOS have an easier time with C support, because their Objective-C code can directly call C functions. The "C family" that includes Objective-C and C++ is obviously thriving, but plain old C is still a key language too. Kernighan freely admits, "A lot of things are done better in other languages," but C is still queen for its "efficiency and expressiveness" at low-level computing.

Ritchie created C with hardware in mind, and with the goal of making access to this hardware as efficient as possible while preserving the readability and portability that assembly language lacks. So it's not surprising that C is still used for embedded systems and other code that refers to hardware ports or specific memory locations. But C has also proven a darn good fit for any code requiring bit-level operations. Finally, it is still first choice for blazingly fast applications, which is why every modern scripting language has some kind of C gateway and why programmers still "drop down to C" for some operations. And of course, if you use a function such printf or cos from your favorite scripting language, you are most likely standing on the shoulders of the people who wrote those C library functions.

Kernighan told me that Ritchie was always conscious of the benefits of using minimal resources. Much of what he did was a reaction against the systems he was working with when creating Unix with Ken Thompson (a reaction to the Multics operating system) and C (a reaction to the PL/1 language that Multics was based on). EPL, Bliss, and especially BCPL were more positive influences on C. The things that made C successful not only inspired languages and systems that followed, but kept C a serious contender even after they were developed.

After Unix, Ritchie worked a great deal on Rob Pike's Plan 9 system. Not only did Ritchie use it exclusively, but he contributed to the code and helped manage the project at Bell Labs.


The computing environment of the 1970s was unbelievably limited by modern standards. Kernighan said that Ritchie and Thompson had to design Unix to run in 24,000 bytes. And similar constraints existed in everything: disk space, bandwidth (when networks were invented!), and even I/O devices. The infamous terseness of Unix commands and output were predicated on their use at printers, and helped to save paper.

But that was OK back then. You could make do with a fourteen-character limit on filenames, because programs were divided into relatively few files and you might have to manage only a few hundred files in your own work. You'd never need to run more than 32,768 processes or listen on more than 1024 incoming connections on your system. The Unix definition of time soon proved to be more problematic, both because some Unix programs may last beyond 2038 and because a granularity of less than a second is often important.

But Kernighan points out that keeping things within the limits of a word (16 bits on the PDP-11 that Ritchie wrote for) had additional benefits by reducing complexity and hence the chance for error. If you want to accommodate huge numbers of files within a directory, you have to create a complex addressing system for files. And because the complex system is costly, you need to keep the old, simple system with it and create a series of levels that can be climbed as the user adds more files to the directory. All this makes it harder to program a system that is bug-free, not to mention lean and fast.

The joys of pure computer science

Kernighan's and Ritchie's seminal work came in the 1970s, when Kernighan said the computer field still presented "low-hanging fruit." Partly because they and their cohorts did so well, we don't have much more to do in the basic computer development that underlies the ever-changing cornucopia of protocols and applications that are the current focus of programmers. New algorithms will continue to develop, particularly thanks to the growth of multiprocessing and especially heterogeneous processors. New operating system constructs will be needed in those environments too. But most of the field has moved away from basic computer science toward various applications that deal directly with real-world activities.

We know there are more Dennis Ritchies out there, but they won't be working heavily in the areas that Ritchie worked in. To a large extent, he completed what he started, and inspired many others of his own generation.

October 29 2011

Dennis Ritchie's legacy of elegantly useful tools

On Sunday, 10/30 we're celebrating Dennis Ritchie Day. Help spread the word: #DennisRitchieDay

Shortly after Dennis Ritchie died, J.D. Long (@cmastication) tweeted perhaps the perfect comment on Ritchie's life: "Dennis Ritchie was the architect whose chapel ceiling Steve Jobs painted." There aren't many who remember the simplicity and the elegance of the Unix system that Jobs made beautiful, and even fewer who remember the complexity and sheer awfulness of what went before: the world of IBM's S/360 mainframes, JCL, and DEC's RSX-11.

Much of what was important about the history of Unix is still in OS X, but under the surface. It would have been almost inconceivable for Apple to switch from the PowerPC architecture to the Intel architecture if Unix wasn't written in C (and its successors), and wasn't designed to be portable to multiple hardware platforms. Unix was the first operating system designed for portability. Portability isn't noticeable to the consumer, but it was crucial to Apple's long-term strategy.

OS X applications have become all-consuming things: you can easily pop from email to iTunes to Preview and back again. It's easy to forget one key component of the original Unix philosophy: simple tools that did one thing, did it well, and could be connected to each other with pipes (Doug McIlroy's invention). But simple tools that date back to the earliest days of Unix still live on, and are still elegantly useful.

Dennis Ritchie once said "UNIX is basically a simple operating system, but you have to be a genius to understand the simplicity." It's true. And we need more geniuses who share his spirit.


October 26 2011

Dennis Ritchie Day

Dennis RitchieSunday, October 16 was declared Steve Jobs Day by California's Governor Brown. I admire Brown for taking a step to recognize Jobs' extraordinary contributions, but I couldn't help be struck by Rob Pike's comments on the death of Dennis Ritchie a few weeks after Steve Jobs. Pike wrote:

I was warmly surprised to see how many people responded to my Google+ post about Dennis Ritchie's untimely passing. His influence on the technical community was vast, and it's gratifying to see it recognized. When Steve Jobs died there was a wide lament — and well-deserved it was — but it's worth noting that the resurgence of Apple depended a great deal on Dennis' work with C and Unix.

The C programming language is quite old now, but still active and still very much in use. The Unix and Linux (and Mac OS X and I think even Windows) kernels are all C programs. The web browsers and major web servers are all in C or C++, and almost all of the rest of the Internet ecosystem is in C or a C-derived language (C++, Java), or a language whose implementation is in C or a C-derived language (Python, Ruby, etc.). C is also a common implementation language for network firmware. And on and on.

And that's just C.

Dennis was also half of the team that created Unix (the other half being Ken Thompson), which in some form or other (I include Linux) runs all the machines at Google's data centers and probably at most other server farms. Most web servers run above Unix kernels; most non-Microsoft web browsers run above Unix kernels in some form, even in many phones.

And speaking of phones, the software that runs the phone network is largely written in C.

But wait, there's more.

In the late 1970s, Dennis joined with Steve Johnson to port Unix to the Interdata. From this remove it's hard to see how radical the idea of a portable operating system was; back then OSes were mostly written in assembly language and were tightly coupled, both technically and by marketing, to specific computer brands. Unix, in the unusual (although not unique) position of being written in a "high-level language," could be made to run on a machine other than the PDP-11. Dennis and Steve seized the opportunity, and by the early 1980s, Unix had been ported by the not-yet-so-called open source community to essentially every mini-computer out there. That meant that if I wrote my program in C, it could run on almost every mini-computer out there. All of a sudden, the coupling between hardware and operating system was broken. Unix was the great equalizer, the driving force of the Nerd Spring that liberated programming from the grip of hardware manufacturers.

The hardware didn't matter any more, since it all ran Unix. And since it didn't matter, hardware fought with other hardware for dominance; the software was a given. Windows obviously played a role in the rise of the x86, but the Unix folks just capitalized on that. Cheap hardware meant cheap Unix installations; we all won. All that network development that started in the mid-80s happened on Unix, because that was the environment where the stuff that really mattered was done. If Unix hadn't been ported to the Interdata, the Internet, if it even existed, would be a very different place today.

I read in an obituary of Steve Jobs that Tim Berners-Lee did the first WWW development on a NeXT box, created by Jobs' company at the time. Well, you know what operating system ran on NeXT's, and what language.

For myself, I can attest that there would be no O'Reilly Media without Ritchie's work. It was Unix that created the fertile ground for our early publishing activities; it was Unix's culture of collaborative development and architecture of participation that was the deepest tap root of what became the open source software movement, and not coincidentally, much of the architecture of the Internet as well. These are the technologies I built my business around. Anyone who has built their software or business with knowledge from O'Reilly books or conferences can trace their heritage back to Ritchie and his compatriots.

I don't have the convening power of a Governor Brown, but for those of us around the world who care, I hereby declare this Sunday, October 30 to be Dennis Ritchie Day! Let's remember the contributions of this computing pioneer.

P.S. Help spread the word. Use the hashtag #DennisRitchieDay on Twitter and Google+

Photo: Via Wikimedia Commons.

Reposted byurfin urfin

October 13 2011

Developer Week in Review: Two giants fall

My apologies for the lack of a Week in Review last week — I was taken by the seasonal plague that's going around the Northeast, and spent most of the last week in a NyQuil haze. Fun bonus fact: Did you know certain prescription drugs inhibit the function of the CYP2D6 enzyme, which means that you can't metabolize Dextromethorphan (aka Robitussin)?

Thankfully, I was able to pull myself up from my sickbed and get my order in for one of those newfangled iPhone 4S contraptions. It's currently sitting at the UPS sorting facility in Kentucky. The faster processor and Siri are nice, but for me the big attraction is the 64GB of storage. I was always bumping up against my current 32GB iPhone 4's disk limit.

On to the Review ...

So long Steve, and thanks for all the apps

iOS App StoreAt this point, pretty much anything I could say about the passing of Steve Jobs has been said so many times already that it would be irrelevant. I was fortunate to see him in person once, at the last WWDC, but like many people, I've followed his career for years. I have somewhat of a unique perspective because I worked at Xerox AI Systems in the mid '80s, selling the Xerox Star (and later Dandelion) with Interlisp, and got to use the Xerox Alto at the MIT AI lab before that. In other words, I was able to use what pretty much became the Mac before the Mac existed.

It was a tremendous source of frustration to those of us who worked at Xerox that the company seemed to have no clue what an incredible breakthrough the Alto and its successors were. Obviously, Jobs had significant amounts of "clueness" because he raided the mouse and GUI wholesale from PARC, and a good thing he did, or we'd still be using CP/M.

One important legacy of Jobs is the App Store model. If you owned a Windows Mobile or Palm device at the turn of this century, you know what a mess it was to get applications to run on them. Until the App Store came along, you either had to hunt around the web for interesting things to run on your smartphone, or you were at the mercy of what your carrier chose to allow. The App Store created both a distribution model and an even playing field for independent and large software makers alike.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR

Goodbye to Dennis Ritchie

The other significant passing we have to mark this week is Dennis Ritchie, father of C and one of the brains behind Unix. It's no exaggeration to say that if you had walked into any programmer's office in the early '80s, you would have probably found a copy of "The C Programming Language" on the bookshelf. Between C (which begat the majority of the modern languages we use today) and Unix (ancestor of Linux, BSD, Solaris, OS X, iOS, and countless other POSIX spin-offs), Ritchie has likely influenced the computer field more than any other single individual in the last 50 years, Donald Knuth included.

Ritchie was a veteran of Bell Labs, the organization we have to thank for fostering the innovative environment that let him be so creative. I'd be hard pressed to find an organization today that is offering that kind of fertile soil, out of which so many beautiful flowers bloomed. Jobs may have been the flashier showman, but he never would have gotten off the ground without the contributions Ritchie made.

Worst reply-all ever?

We got a rare view into the inner workings of Google this week, thanks to an inadvertent broadcasting of a long rant by long-time Google employee Steve Yegge. Yegge accidentally made his short-story-length critique of Google's API policies public on Google+, letting the world know how he felt.

While it will be interesting to see if Yegge's posting turns out to be a career-limiting move, what's more interesting is the insight it gives us into the problems Google is facing internally. Yegge's main complaint is that Google doesn't eat its own dog food when it comes to APIs. He particularly singles out Google+ as an example of a product with almost no useful APIs, and charges Google with developing products rather than platforms.

Those of us who have been frustrated with Google's inability to implement "simple" things like a consistent single sign-on infrastructure would tend to agree.

Got news?

Please send tips and leads here.


June 21 2011

Four short links: 21 June 2011

  1. tmux -- GNU Screen-alike, with vertical splits and other goodies. (via Hacker News)
  2. Gamifying Education (Escapist) -- a more thoughtful and reasoned approach than crude badgification, but I'd still feel happier meddling with kids' minds if there was research to show efficacy and distribution of results. (via Ed Yong)
  3. Rule of 72 (Terry Jones) -- common piece of financial mental math, but useful outside finance when you're calculating any kind of exponential growth (e.g., bad algorithms). (via Tim O'Reilly)
  4. Spam Hits the Kindle Bookstore (Reuters) -- create a system of incentives and it will be gamed, whether it's tax law, search engines, or ebook stores. Aspiring spammers can even buy a DVD box set called Autopilot Kindle Cash that claims to teach people how to publish 10 to 20 new Kindle books a day without writing a word. (via Clive Thompson)

May 10 2011

Four short links: 10 May 2011

  1. ODB to iPhone Converter -- hardware to connect to your car's onboard computer and display it on an iPhone app. (via Imran Ali)
  2. Multitasking Brains (Wired) -- interesting pair of studies: old brains have trouble recovering from distractions; hardcore multitaskers have trouble focusing. (via Stormy Peters)
  3. Social Privacy -- Danah Boyd draft paper on teens' attitudes to online privacy. Interesting take on privacy as about power: This incident does not reveal that teens don't understand privacy, but rather that they lack the agency to assert social norms and expect that others will respect them. (via Maha Shaikh)
  4. Cool but Obscure Unix Tools -- there were some new tricks for this old dog (iftop, socat). (via Andy Baio)

April 20 2011

Four short links: 20 April 2011

  1. PDP-11 Emulator in Javascript, Running V6 UNIX -- blast from the past, and quite a readable emulator (heads up: cd was chdir back then). See also the 1st edition UNIX source on github. (via Hacker News)
  2. 2010: The Year of Crowdsourcing Transcription -- hasn't finished yet, as NY Public Library shows. Cultural institutions are huge data sets that need human sensors to process, so we'll be seeing a lot more of this in years to come as we light up thousands of years of written culture. (via Liza Daley)
  3. Programming the Commodore 64 -- the loss of the total control that we had over our computers back when they were small enough that everything you needed to know would fit inside your head. It’s left me with a taste for grokking systems deeply and intimately, and that tendency is probably not a good fit for most modern programming, where you really don’t have time to go in an learn, say, Hibernate or Rails in detail: you just have to have the knack of skimming through a tutorial or two and picking up enough to get the current job done, more or less. I don’t mean to denigrate that: it’s an important and valuable skill. But it’s not one that moves my soul as Deep Knowing does. This is the kind of deep knowledge of TCP/IP and OS that devops is all about.
  4. Kids do Science -- scientists lets kids invent an experiment, write it up, and it's published in Biology Letters. Teaching the method of science, not the facts currently in vogue, will give us a generation capable of making data-based decisions.

April 07 2011

Data hand tools

drill bitThe flowering of data science has both driven, and been driven by, an explosion of powerful tools. R provides a great platform for doing statistical analysis, Hadoop provides a framework for orchestrating large clusters to solve problems in parallel, and many NoSQL databases exist for storing huge amounts of unstructured data. The heavy machinery for serious number crunching includes perennials such as Mathematica, Matlab, and Octave, most of which have been extended for use with large clusters and other big iron.

But these tools haven't negated the value of much simpler tools; in fact, they're an essential part of a data scientist's toolkit. Hilary Mason and Chris Wiggins wrote that "Sed, awk, grep are enough for most small tasks," and there's a layer of tools below sed, awk, and grep that are equally useful. Hilary has pointed out the value of exploring data sets with simple tools before proceeding to a more in-depth analysis. The advent of cloud computing, Amazon's EC2 in particular, also places a premium on fluency with simple command-line tools. In conversation, Mike Driscoll of Metamarkets pointed out the value of basic tools like grep to filter your data before processing it or moving it somewhere else. Tools like grep were designed to do one thing and do it well. Because they're so simple, they're also extremely flexible, and can easily be used to build up powerful processing pipelines using nothing but the command line. So while we have an extraordinary wealth of power tools at our disposal, we'll be the poorer if we forget the basics.

With that in mind, here's a very simple, and not contrived, task that I needed to accomplish. I'm a ham radio operator. I spent time recently in a contest that involved making contacts with lots of stations all over the world, but particularly in Russia. Russian stations all sent their two-letter oblast abbreviation (equivalent to a US state). I needed to figure out how many oblasts I contacted, along with counting oblasts on particular ham bands. Yes, I have software to do that; and no, it wasn't working (bad data file, since fixed). So let's look at how to do this with the simplest of tools.

(Note: Some of the spacing in the associated data was edited to fit on the page. If you copy and paste the data, a few commands that rely on counting spaces won't work.)

Log entries look like this:

QSO: 14000 CW 2011-03-19 1229 W1JQ       599 0001  UV5U       599 0041
QSO: 14000 CW 2011-03-19 1232 W1JQ       599 0002  SO2O       599 0043
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0003  RG3K       599 VR  
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0004  UD3D       599 MO  

Most of the fields are arcane stuff that we won't need for these exercises. The Russian entries have a two-letter oblast abbreviation at the end; rows that end with a number are contacts with stations outside of Russia. We'll also use the second field, which identifies a ham radio band (21000 KHz, 14000 KHz, 7000 KHz, 3500 KHz, etc.) So first, let's strip everything but the Russians with grep and a regular expression:

$ grep '599 [A-Z][A-Z]' rudx-log.txt | head -2
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0003  RG3K       599 VR
QSO: 21000 CW 2011-03-19 1235 W1JQ       599 0004  UD3D       599 MO

grep may be the most useful tool in the Unix toolchest. Here, I'm just searching for lines that have 599 (which occurs everywhere) followed by a space, followed by two uppercase letters. To deal with mixed case (not necessary here), use grep -i. You can use character classes like :upper: rather than specifying the range A-Z, but why bother? Regular expressions can become very complex, but simple will often do the job, and be less error-prone.

If you're familiar with grep, you may be asking why I didn't use $ to match the end of line, and forget about the 599 noise. Good question. There is some whitespace at the end of the line; we'd have to match that, too. Because this file was created on a Windows machine, instead of just a newline at the end of each line, it has a return and a newline. The $ that grep uses to match the end-of-line only matches a Unix newline. So I did the easiest thing that would work reliably.

The simple head utility is a jewel. If you leave head off of the previous command, you'll get a long listing scrolling down your screen. That's rarely useful, especially when you're building a chain of commands. head gives you the first few lines of output: 10 lines by default, but you can specify the number of lines you want. -2 says "just two lines," which is enough for us to see that this script is doing what we want.

Next, we need to cut out the junk we don't want. The easy way to do this is to use colrm (remove columns). That takes two arguments: the first and last column to remove. Column numbering starts with one, so in this case we can use colrm 1 72.

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | head -2

How did I know we wanted column 72? Just a little experimentation; command lines are cheap, especially with command history editing. I should actually use 73, but that additional space won't hurt, nor will the additional whitespace at the end of each line. Yes, there are better ways to select columns; we'll see them shortly. Next, we need to sort and find the unique abbreviations. I'm going to use two commands here: sort (which does what you'd expect), and uniq (to remove duplicates).

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | sort |\
   uniq | head -2

Sort has a -u option that suppresses duplicates, but for some reason I prefer to keep sort and uniq separate. sort can also be made case-insensitive (-f), can select particular fields (meaning we could eliminate the colrm command, too), can do numeric sorts in addition to lexical sorts, and lots of other things. Personally, I prefer building up long Unix pipes one command at a time to hunting for the right options.

Finally, I said I wanted to count the number of oblasts. One of the most useful Unix utilities is a little program called wc: "word count." That's what it does. Its output is three numbers: the number of lines, the number of words, and the number of characters it has seen. For many small data projects, that's really all you need.

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | colrm 1 72 | sort | uniq | wc
      38      38     342

So, 38 unique oblasts. You can say wc -l if you only want to count the lines; sometimes that's useful. Notice that we no longer need to end the pipeline with head; we want wc to see all the data.

But I said I also wanted to know the number of oblasts on each ham band. That's the first number (like 21000) in each log entry. So we're throwing out too much data. We could fix that by adjusting colrm, but I promised a better way to pull out individual columns of data. We'll use awk in a very simple way:

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |\
     sort | uniq 
14000 AD
14000 AL
14000 AN

awk is a very powerful tool; it's a complete programming language that can do almost any kind of text manipulation. We could do everything we've seen so far as an awk program. But rather than use it as a power tool, I'm just using it to pull out the second and eleventh fields from my input. The single quotes are needed around the awk program, to prevent the Unix shell from getting confused. Within awk's print command, we need to explicitly include the space, otherwise it will run the fields together.

The cut utility is another alternative to colrm and awk. It's designed for removing portions of a file. cut isn't a full programming language, but it can make more complex transformations than simply deleting a range of columns. However, although it's a simple tool at heart, it can get tricky; I usually find that, when colrm runs out of steam, it's best jumping all the way to awk.

We're still a little short of our goal: how do we count the number of oblasts on each band? At this point, I use a really cheesy solution: another grep, followed by wc:

$ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |\
     sort | uniq | grep 21000 | wc
      20      40     180
$ grep '599 [A-Z][A-Z]' rudx-log.txt  | awk '{print $2 " " $11}' |\
     sort | uniq | grep 14000 | wc
      26      52     234

OK, 20 oblasts on the 21 MHz band, 26 on the 14 MHz band. And at this point, there are two questions you really should be asking. First, why not put grep 21000 first, and save the awk invocation? That's just how the script developed. You could put the grep first, though you'd still need to strip extra gunk from the file. Second: What if there are gigabytes of data? You have to run this command for each band, and for some other project, you might need to run it dozens or hundreds of times. That's a valid objection. To solve this problem, you need a more complex awk script (which has associative arrays in which you can save data), or you need a programming language such as perl, python, or ruby. At the same time, we've gotten fairly far with our data exploration, using only the simplest of tools.

Now let's up the ante. Let's say that there are a number of directories with lots of files in them, including these rudx-log.txt files. Let's say that these directories are organized by year (2001, 2002, etc.). And let's say we want to count oblasts across all the years for which we have records. How do we do that?

Here's where we need find. My first approach is to take the filename (rudx-log.txt) out of the grep command, and replace it with a find command that looks for every file named rudx-log.txt in subdirectories of the current directory:

$ grep '599 [A-Z][A-Z]' `find . -name rudx-log.txt -print`  |\
   awk '{print $2 " " $11}' | sort | uniq | grep 14000 | wc
      48      96     432

OK, so 48 directories on the 14 MHz band, lifetime. I thought I had done better than that. What's happening, though? That find command is simply saying "look at the current directory and its subdirectories, find files with the given name, and print the output." The backquotes tell the Unix shell to use the output of find as arguments to grep. So we're just giving grep a long list of files, instead of just one. Note the -print option: if it's not there, find happily does nothing.

We're almost done, but there are a couple of bits of hair you should worry about. First, if you invoke grep with more than one file on the command line, each line of output begins with the name of the file in which it found a match:

./2008/rudx-log.txt:QSO: 14000 CW 2008-03-15 1526 W1JQ      599 0054 \\
UA6YW         599 AD    
./2009/rudx-log.txt:QSO: 14000 CW 2009-03-21 1225 W1JQ      599 0015 \\
RG3K          599 VR    

We're lucky. grep just sticks the filename at the beginning of the line without adding spaces, and we're using awk to print selected whitespace-separated fields. So the number of any field didn't change. If we were using colrm, we'd have to fiddle with things to find the right columns. If the filenames had different lengths (reasonably likely, though not possible here), we couldn't use colrm at all. Fortunately, you can suppress the filename by using grep -h.

The second piece of hair is less common, but potentially more troublesome. If you look at the last command, what we're doing is giving the find command a really long list of filenames. How long is long? Can that list get too long? The answers are "we don't know," and "maybe." In the nasty old days, things broke when the command line got longer than a few thousand characters. These days, who knows what's too long ... But we're doing "big data," so it's easy to imagine the find command expanding to hundreds of thousands, even millions of characters. More than that, our single Unix pipeline doesn't parallelize very well; and if we really have big data, we want to parallelize it.

The answer to this problem is another old Unix utility, xargs. Xargs dates back to the time when it was fairly easy to come up with file lists that were too long. Its job is to break up command line arguments into groups and spawn as many separate commands as needed, running in parallel if possible (-P). We'd use it like this:

$ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]'  |\ 
  awk '{print $2 " " $11}' | grep 14000 | sort | uniq | wc
      48      96     432

This command is actually a nice little map-reduce implementation: the xargs command maps grep all the cores on your machine, and the output is reduced (combined) by the awk/sort/uniq chain. xargs has lots of command line options, so if you want to be confused, read the man page.

Another approach is to use find's -exec option to invoke arbitrary commands. It's somewhat more flexible than xargs, though in my opinion, find -exec has the sort of overly flexible but confusing syntax that's surprisingly likely to lead to disaster. (It's worth noting that the examples for -exec almost always involve automating bulk file deletion. Excuse me, but that's a recipe for heartache. Take this from the guy who once deleted the business plan, then found that the backups hadn't been done for about 6 months.) There's an excellent tutorial for both xargs and find -exec at Softpanorama. I particularly like this tutorial because it emphasizes testing to make sure that your command won't run amok and do bad things (like deleting the business plan).

That's not all. Back in the dark ages, I wrote a shell script that did a recursive grep through all the subdirectories of the current directory. That's a good shell programming exercise which I'll leave to the reader. More to the point, I've noticed that there's now a -R option to grep that makes it recursive. Clever little buggers ...

Before closing, I'd like to touch on a couple of tools that are a bit more exotic, but which should be in your arsenal in case things go wrong. od -c gives a raw dump of every character in your file. (-c says to dump characters, rather than octal or hexadecimal). It's useful if you think your data is corrupted (it happens), or if it has something in it that you didn't expect (it happens a LOT). od will show you what's happening; once you know what the problem is, you can fix it. To fix it, you may want to use sed. sed is a cranky old thing: more than a hand tool, but not quite a power tool; sort of an antique treadle-operated drill press. It's great for editing files on the fly, and doing batch edits. For example, you might use it if NUL characters were scattered through the data.

Finally, a tool I just learned about (thanks, @dataspora): the pipe viewer, pv. It isn't a standard Unix utility. It comes with some versions of Linux, but the chances are that you'll have to install it yourself. If you're a Mac user, it's in macports. pv tells you what's happening inside the pipes as the command progresses. Just insert it into a pipe like this:

$ find . -name rudx-log.txt -print | xargs grep '599 [A-Z][A-Z]'  |\ 
  awk '{print $2 " " $11}' | pv | grep 14000 | sort | uniq | wc
3.41kB 0:00:00 [  20kB/s] [<=>  
      48      96     432

The pipeline runs normally, but you'll get some additional output that shows the command's progress. If something's getting malfunctioning or performing too slowly, you'll find out. pv is particularly good when you have huge amounts of data, and you can't tell whether something has ground to a halt, or you just need to go out for coffee while the command runs to completion.

Whenever you need to work with data, don't overlook the Unix "hand tools." Sure, everything I've done here could be done with Excel or some other fancy tool like R or Mathematica. Those tools are all great, but if your data is living in the cloud, using these tools is possible, but painful. Yes, we have remote desktops, but remote desktops across the Internet, even with modern high-speed networking, are far from comfortable. Your problem may be too large to use the hand tools for final analysis, but they're great for initial explorations. Once you get used to working on the Unix command line, you'll find that it's often faster than the alternatives. And the more you use these tools, the more fluent you'll become.

Oh yeah, that broken data file that would have made this exercise superfluous? Someone emailed it to me after I wrote these scripts. The scripting took less than 10 minutes, start to finish. And, frankly, it was more fun.

Related books:

Related coverage:

January 12 2011

Developer Week in Review

Now firmly seated in the New Year, your week in review returns to its normally scheduled programming.

No sale for Novell?

As reported in the Year in Review, Novell had plans to sell a chunk of Unix intellectual property to CPTN Holdings, a consortium that includes Microsoft, Apple, EMC and Oracle. This reopened the fear that Linux would come under patent attack. Last week, it was reported that the deal was evidently off, but according to Microsoft, it was just a procedural thing with German regulators, and the process is moving ahead according to plan.

Assuming this sale goes through, it will remain to be seen if the Gang of Four takes the next step and tries to prosecute any of the patents against the open source community. It's possible that they intend to use them against other companies, or as protection against IP actions. But given Microsoft's history in the SCO controversy and the company's feelings about Linux, it is also possible that pigs will fly.

The worst kept secret in the Industry

If you haven't heard that Apple finally inked a deal with Verizon this week, you should consider subletting the rock you've been hiding under. The interesting question that no one seems to be asking is if this is going to start the fractionalization of the iOS developer community. The Verizon version of the iPhone will ship with a mobile hotspot feature that the AT&T version lacks, and you can't help but wonder if other differences will creep into the iPhone over time as different carriers put different restrictions and requirements on the platform. One of the major selling points of the iPhone is that there has been little platform diversity for developers to deal with, apart from some sensors and the iPad. If too much branching of the hardware and software platform occurs, Apple could find themselves in the same boat with Android.

We also know that certain apps were banned from the App Store because AT&T objected to them. Will apps now have to pass muster for two different carriers, or will we start to see AT&T and Verizon-only applications?

Tablets, tablets, tablets!

That yearly pilgrimage of tech-heads, CES, has ended, and the big news for software developers is that tablets appear to be the new black. Multiple vendors showed off iPad wannabes at CES, many based on Android, a few on Linux, and a few running Windows.

Smartphones have already changed how software is developed, as applications have moved away from the keyboard-and-mouse input model. But until now, desktop-level applications have still clung to the old way. As tablets start to replace notebooks and netbooks, we're likely to see development shifts in productivity and enterprise applications that traditionally were tethered to a keyboard.

What does the future hold for those who code? My crystal ball is currently installing update 2 of 543, so I guess you'll have to check back here next week to find out. Suggestions are always welcome, so please send tips or news here.


April 07 2010

Older posts are this way If this message doesn't go away, click anywhere on the page to continue loading posts.
Could not load more posts
Maybe Soup is currently being updated? I'll try again automatically in a few seconds...
Just a second, loading more posts...
You've reached the end.

Don't be the product, buy the product!