Social Web or Tempo Web?

Back in 2004, Tim O’Reilly observed that the Web had changed, and coined the term Web 2.0. This new Web is made of several layers which enable the Social Web. Wikipedia and Facebook are defining examples of the Social Web.

This sudden discovery of the Social Web feels wrong to me. In the early nineties, I was an active user of Bulletin Board Systems (BBS). While it was not the Web, or even part of the Internet, BBSes were clearly a social media. You know the multi-user games people play on Facebook? We had that back in 1990. The graphics were poorer, obviously, but it was all about meeting people.

The barrier to entry keeps getting lower, to the point were even grand-fathers are now on Facebook. But the Web has hardly been limited to an elite. Even BBSes were quite democratic: retired teachers would chat with young hackers all the time. It is the extreme low cost of computers and their ubiquity which makes the Social Web so widespread.

A much more interesting change has received less notice: the tempo of the Web is changing. Geocities made it easy for anyone to create a home page. But updating your home page was a slow process. In effect, our mental model of the Web was that of a library, and Web sites were books that could be updated from time to time. Eventually, we gave up on this model and decided to view the Web as a data stream. This realization changed everything.

The pace used to range from static web pages to flaming on posting boards. We have now expanded our temporal range. We can now communicate with high frequency in short bursts. Twitter is one extreme: it is akin to techno music. Facebook is somewhat slower, and more elaborate, maybe  like rock. Posting research articles is no music at all: it is akin to the rhythm of the Earth around the Sun. These tools don’t just differ on the frequency of the updates, but also on their volume, and on the length of the pauses.

Maybe we should try to understand the Web by analogy with music. How does the Web sound to you, today?

Further reading: See my blog post Is Map-Reduce obsolete? Also, be sure to check Venkatesh Rao’s blog. Rao has a new book which I will review in the future.

Not even eventually consistent

Many databases engines ensure consistency: at any given time, the database state is logically consistent. For example, even if you receive purchase requests by the thousands, you will always have an accurate count of how many products you have sold, and how many remain in stock. Accountants are especially concerned with consistency: they have invented techniques such as the double-entry systems to favor consistency. Inconsistencies cause problems:

  • The user bought a product, a purchase record has been added, but the user account has not yet been charged. This could allow a user to buy more than he can afford.
  • All items in stock have been sold, but a customer is being told that a few items remain in stock. Thus, a vendor could make sales it cannot fulfill.

Yet, in practice, requiring consistency means that your system will become unavailable from time to time. Thus, many NoSQL databases have adopted an eventual consistency approach. That is, while at any point in time the system might be logically inconsistent, it will eventually recover:

  • The user account will  be charged prior to the delivery of the product.
  • A customer who ordered a product that is no longer in stock will be told prior to finishing his order.

Even though it give headaches to the developers, eventual consistency is good enough. In any case, robust systems have to deal with exceptions. Even if your systems tell you that there enough items on stock, it could be that one item was damaged in the warehouse, or that another vendor is willing to quickly provide you with missing items. That is, at best, the data in your database is a logically consistent abstraction. But all non-trivial abstractions, to some degree, are leaky. And that is why we pay accountants: they chase down the leaks.

Maybe we should keep in mind that the largest, most powerful, information system ever designed is logically inconsistent: the Web is full of dangling and misdirected hyperlinks. And it would be extremely hard to debug the Web. Thankfully, we do not need to. My own Web site is probably filled with mistakes. I am sure that I link to pages that no longer exist. Still, it works. The cost of maintaining the integrity would be too high. The errors are acceptable.

As an analogy, security experts rarely try to fully secure systems. Instead, they identify components that must be secured, and they determine the risk tolerance. Complete security is simply too expensive in practice, unless you are willing to live in an isolated bunker.

Thus, maybe the solution will be to accept that information systems that are not even eventually consistent. If I run a business, all that matters to me is that my customers are happy, and I am making a healthy profit. Everything else is secondary. I might be willing to tolerate that some customers are never charged for an item they receive. I might be willing to slightly overstock on some items.

Sure, some might describe a not even eventually consistent system as an abomination, something impossible to program for… but the Web would have been described this way in the 1980s. The key is that the Web is not inconsistent in any random way: it has its own viable logic.

Further reading: The CALM Conjecture: Reasoning about Consistency

Book review: Statistical Analysis with R

The programming language R is a standard for statisticians. And it is free software which runs on Windows, Mac and Linux.

You can learn much online about R, but if you prefer a bona fide book, there are also many to choose from. I just finished one of those: Statistical Analysis with R by John M. Quick.

The book is colorful and ludic which is a good idea for a “Beginner’s Guide”. The layout is attractive, there are many detailed examples. I like the pedagogy: the author wants you to learn by doing. However, the book is poor as a reference: the coverage is limited despite the 300 pages and the chapter summaries are strictly non-technical.

Overall this is a good book for people with no programming background who just want to use R to load data, do ANOVA tests and simple models. They will find step-by-step instructions down to the installation of the software, complete with screenshots of every step. This could be a good book for scientists or business people who want to try R as a substitute for Excel.

Disclosure: I got a free copy of the e-book from the publisher with the expectation that I would publish a review.

Innovating without permission

Is Open Source software better than closed-source software? Is Wikipedia better than Britannica? Is NoSQL better than Oracle and SQL Server? Are blogs better than corporate news services? Do patents favor innovation? Do long and painful funding applications help science? Is learning off Wikipedia and YouTube as good as learning in a classroom? Is duck typing good or bad?

All these questions are closely related to the concept of Innovation without permission: people innovate more when they do not have to ask permission.

Innovation without permission is analogous to Ridiculously Easy Group-Formation, popularized by Clay Shirky, but first proposed by my colleague Sébastien Paquet in 2002. Paquet’s message is that we are pushing the threshold for group creation to an unprecedented low.  In 2011, it is hard to disagree with Paquet: entire political movements come online together automatically.

Similarly, we are lowering the barriers to innovation to unprecedented lows:

  • Open Source developers are probably no better than closed-source developers, but they typical produce their work without asking permission. Moreover, Open Source has gotten much easier over the years. Starting a project was relatively difficult in the nineties. Then sourceforge came along. Today, we have Github which is even easier. Using Github, people I never heard about, and never approved, submitted patches to some of my code without permission! Fantastic!
  • When the era of personal computers arrived, a small team could publish a game, get it distributed through stores and sold to millions of people. Yet, last year, Markus Persson became a millionaire in a month by publishing an unfinished game as a Java applet from his web site (Minecraft). At each step in this process, the barrier to innovation becomes lower. First, you had to ask permission to the stores. Today, you can innovate and just post your work and, apparently, get paid very well for your effort.
  • Many of the best-selling titles in the Amazon kindle store come from self-published authors. People write and sell books without asking permission to a publisher! They are also making more money than they used to.
  • Wikipedia contributors and editors are no better than Britannica’s authors, but they mostly work without asking permission.
  • NoSQL databases often allow developers to add new features without having to convince a DBA to change the database schema.
  • Bloggers post without the approval of an editor.
  • Duck Typing allows you to use a function that was meant for a different data type without having to ask for a change in the API.
  • The barriers in scientific research has come down significantly. You can solve problems, post your solutions online and people might offer you a million dollar. It happened to Perelman. Increasingly, access to the literature is Open Access. Anyone can read the papers on PLoS and even contribute comments.

We can identify some barriers to innovation which are also great opportunities:

  • Patents and copyright are first on my list. Much of the copyrighted content is easier to pirate than to license. Unsurprisingly, people do not ask permission and they violate copyright. A possible outcome is that all industries could become like the fashion industry where copyright is largely irrelevant.
  • Funding remains a significant barrier to innovation.  The Kickstarter model is interesting in this respect.
  • While I love web applications such as Google Mail, YouTube and Facebook, they won’t let their users redesign the product. You can post innovative content, but you cannot contribute (much) to the architecture of the application. Twitter is maybe the exception where users redefined Twitter by using tags and retweets, both design elements that were not present in the system initially. I am amazed that nobody has thought of creating a web application letting people program and distribute video games online. (If you implement this idea well and become a millionaire, please send me a check.)

Demarchy and probabilistic algorithms


Demarchy are political systems built using randomness. Demarchy has been used to designate political leaders in Ancient Greece and in France (during the French revolution). In many countries, jury selection is random. On Slashdot, the comment moderation system relies on randomly selected members: as you use the system, you increase your (digital) karma which increases the probability that you will become a moderator. Demarchy has been depicted by several science-fiction authors including Arthur C. Clarke, Kim Stanley Robinson, Ken MacLeod and Alastair Reynolds.

Democracy is deterministic: we send the representative with the most votes. Demarchy  is probabilistic: we need a source of random numbers to operate it. Unfortunately, probabilities are less intuitive than simple statistical measures like the most frequent item. It is also much more difficult to implement demarchy, at least without a computer. Meanwhile, most software uses probabilistic algorithms:

  • Databases use hash tables to recover values in expected constant time;
  • Cryptography is almost entirely about probabilities;
  • Most of Machine Learning is based on probabilistic models.

In fact, probabilistic algorithms work better than deterministic algorithms.

Some common objections to demarchy:

  • Not everyone is equally qualified to run a country or a city. My counter-argument: Demarchy does not imply that everyone has an equal chance of being among the leaders. Just like in a jury selection, some people are disqualified. Moreover, just like Slashdot, there is no reason to believe that everyone would get an equal chance. Active participants in the political life might receive higher “digital karma”. Of course, from time to time, incompetent leaders would be selected, by random chance. But we also get incompetent leaders in our democratic system.
  • There is too much to learn: the average Joe would be unable to cope. My counter-argument: Demarchy does not imply that a dictator for life is nominated, or that the leaders can make up all the rules. In the justice system, jury selection covers a short period of time and scope (one case), is closely supervised by an arbitrator (the judge), and typically involves several nominates (a dozen in Canada). And if qualifications are what matters, then why don’t we elect leaders based on their objective results to some tests? The truth is that many great leaders did not do so well on tests. Most notably, Churchill was terrible at Mathematics. Moreover, experts such as professional politicians are often overrated:  it has been claimed that picking stock at random is better than hiring an expert.
  • Demarchy nominates would be less moral than democratically elected leaders. My counter-argument: I think the opposite would happen. Democratic leaders often feel entitled to their special status. After all, they worked hard for it. Studies show that white collar criminals are not less moral than the rest of us, they just think that the rules don’t apply to them because they are special. Politics often attract highly ambitious (and highly entitled) people. Meanwhile, demarchy leaders would be under no illusion as to their status. Most people would see the nomination as a duty to fulfill. It is much easier for lobbyist to target professional politicians than short-lived demarchy nominates.

Further reading: Brian Martin’s writings on demarchy and democracy and How to choose? by Michael Schulson.

Our institutions are limited by the pre-digital technology

Much of our institutions are limited by the pre-digital technology: (1) It is difficult to constantly re-edit a paper book; (2) without computers, global trade requires perennial currencies ; (3) without information technology, any political system more fluid than representative democracy cannot scale up to millions of individuals. These embedded limitations still constrain us, for now.

  • We teach kids arithmetic and calculus, but systematically fail to teach them about probabilities. We are training them to distinguish truth from falsehoods, when most things are neither true nor false. Meanwhile, sites like Stack Overflow and Math Overflow expose these same kids to rigorous debates, much like those I imagine the Ancient Greeks having. We are exposing the inner workings of scholarship like never before. Instead of receiving knowledge from carefully crafted books and lectures, kids have to be trained to seek out truth on their own. The new generation will expect textbooks to editable, like Wikipedia. You might be 16, but you can still correct a professor. But  the 12 years old next door can correct you too. Are we still going to have static books, idealized in some “final version”? I think that books and journal articles will have to become dynamic, or they will grow irrelevant.
  • Why do we need central banks and currencies? Because they, alone, can ensure determinism and stability? Why can’t we have several competing currencies? In any case, how wealthy is Mark Zuckerberg? Financial determinism is an illusion. There may be a picture of the Queen on my dollar bills, but the authority of the Monarch does not protect the value of this currency. If not for their fear of upsetting local governments, VISA and Mastercard could create a new currency in an instant. If not for government regulations, Paypal would be competing with central banks. Suppose that I help a game company. Maybe they could pay me with their in-game currency. Many people worldwide would be willing to trade this currency against other goods and services. At no point do I need to print money! We could even automate the process. I get some in-game currency and I want to buy food, can a computer arrange trades so that I get bags of rice delivered at my home? Of course, it can. Obviously, governments would start regulating because, these sort of trades effectively cut the government out.  But I think it would be very difficult to outlaw or regulate these trades in a context where the financial markets are driven by algorithms.
  • We expect political leaders to represent the interest of an entire population. Have you ever stopped to consider how insane this expectation is? Would you trust a jury made of a single person? We have learned from machine learning that the best results were typically achieved with Ensemble Methods. What we need are sets of leaders, automatically selected and weighted. We don’t want the best leader, we want the best ensemble of leaders. And the best way to avoid biases is to change these leaders frequently. We almost have the technology to do it in a scalable manner. Electronic voting is only the first step toward a new, more fluid, form of politics. If it sounds crazy, consider that people spend more and more time online, where politics is very different than in our brick-and-mortal world. The models emerging online will be implemented in offline politics. That is where the political innovation will occur.
Currency Mathematics Politics Knowledge
Primitive None One, Two or Many Tribe Stories
Moderate Local trade Arithmetic and geometry Monarchy Manuscripts
Advanced Central currencies Calculus Democracy Books, Authoritative Science
Upcoming Distributed and electronic Bayesian Demarchy Open and Participative Scholarship

Make your own programmable digital thermometer in an hour

I make my own yogourt because I cannot stand commercial yogourt. You can make your own yogourt in less than 30 minutes: heat milk to 112F (44C), mix in a small quantity of yogourt, leave the container of warm milk in a blanket overnight.

To minimize the labor I wanted a digital thermometer with an alarm set at 112F. Alas, most kitchen digital thermometers are designed for cooking meat. They can take intense heat, but exposure to milk shortens their life considerably.

Inspired by Frauenfelder’s Made by Hand, I decided to make my own programmable digital thermometer.

You can do it in an hour and for less than $80. You do not need any specific knowledge. A kid could do it.

You need to order:

  • An Arduino board ($30). I have used an Uno Arduino.
  • An LCD Keypad Shield ($24). I have used the DFRobot Shield.
  • A ZX-Thermometer Temperature Sensor by INEX Robotics ($8). (Update: my tests show that its surface is lead-free.)
  • You also need a USB cable to connect the Arduino to your computer, wires, a generic piezo element (as a buzzer) and a resistance. I recommend buying an Arduino Starter kit.
  • You may also need a 9V battery and an adaptor to connect it to the Arduino board. You can get a 9V to 2.1mm Barrel Jack for $3.
  • I have used a breadboard for convenience.

The assembly takes less than an hour (see my Picasa album for all the pictures).

  1. Put the LCD shield on the Arduino.
  2. Connect the thermometer on the analog port 2 (middle wire). Connect another wire to the 5V and another yet to ground. You need to add a resistance between the 5V and the thermometer.
  3. Connect the piezo on digital 12. Don’t forget to ground one half of it.
  4. Connect the Arduino to your computer by USB. Upload the program using the Arduino IDE. I have posted the C++ software on github. You can use Linux, Windows or a Mac to connect to the Arduino.

Operating the thermometer is easy. Give it some power. The display should come alive. Put the end of the black wire from the thermometer in your liquid. The up and down button control the target temperature. When it is reached, the piezo will play a bad rendition of Frère Jacques. Make sure you disconnect the battery once you are done to save power. This thermometer should work well for beer and yogourt making.

You can easily customize it by adding timers and several different target temperatures. Instead of a piezo element, you could use a voice synthesizer. Best of all, if the temperature sensor breaks, you only need to replace an $8 component.

Further reading: Arduino and Open hardware.

Disclosure: Though I link to RobotShop products, I am not affiliated with them in any way. The main Arduino web site has a list of Arduino distributors by country.

Code: Source code posted on my blog is available from a github repository.

For your in-memory databases, do you really need an index?

For large data sets on disk, indexes are often essential. However, if your data fits in RAM, indexes are often unnecessary. They may even be harmful.

Consider a table made of 10,000,000 rows and 10 columns. Using normalization, you can replace each value by a 32-bit integer for a total of 381 MB. How long does it take to scan it all?

  • When the data is on disk, it takes 0.5 s to scan the data. To maximize buffering, I have used a memory-mapped file.
  • When the data is in memory, it takes 0.06 s.

Can you bring the 0.06 s figure down with an index? Maybe. But consider that :

  • Indexes use memory, sometimes forcing you to store more data on disk.
  • Indexes typically slow down construction and updates.
  • Indexes typically only improve the performance of some queries. This is especially true with multidimensional data.

Verify my results: My Java code is available on paste.bin. I ran the tests on an iMac with an Intel Core i7 (2.8 GHz) processor.

Source: This blog post was motivated by a question from Julian Hyde, of Mondrian fame.

Further reading: Understanding what makes database indexes work

Code: Source code posted on my blog is available from a github repository.

Who will need database administrators in 2020?

In response to my Why do we need database joins? post, many readers stressed the importance of strict database schemas to preserve data integrity. In short, we want database administrators (DBA) to input constraints at design time so that the integrity of the database is insured no matter how lousy your programmers are. There is nothing a DBA hates more than having to recover a database from the backup tapes. And several businesses simply cannot afford to have their databases disrupted.

And really, businesses can be careless with their data. I have repeatedly seen businesses and public organizations hire students or recent graduates for 3 months so that they can add a feature or quickly build a web application. Every single time, I cringe. And almost every single time it ends badly. Managers shouldn’t mess with software and data without real professionals. Alas, the bad ones often do. In such a context, we need DBAs to protect the data, to keep the useful applications running.

So, unsurprisingly, we keep hearing that NoSQL databases fail to get any traction in large organizations. Is this a surprise? NoSQL tends to do away with schemas. It gives a lot more power to the programmer. More power to screw up as well. So,  we are getting at the heart of the matter. NoSQL is not meant for DBAs. In fact, it is a coup against DBAs:

NoSQL is for programmers. This is a developer led coup. The response to a database problem can’t always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves.

In effect, NoSQL developers are working to make DBAs less relevant (or even irrelevant). And, if history is our guide, they will succeed. I wouldn’t be surprised if in ten years, we declared database administration to be an obsolete occupation. The software will protect the data better than any DBA ever could. This revolution, however, cannot come from the vendors who sell to DBAs. We must have disruptive innovation. And this is exactly what NoSQL is.

Note: I have nothing against DBAs. I expect to be obsolete myself by 2020. (And hopefully, I’ll find a way to retire early.)

Over-normalization is bad for you

I took a real beating with my previous post where I argued against excessive normalization on the grounds that it increases complexity and inflexibility, and thus makes the application design more difficult. Whenever people get angry enough to post comments on a post of mine, I conclude that I am onto something. So, let’s go at it again.

On the physical side, developers use normalization to avoid storing redundant data. While this might be adequate with modern data database systems, I do not think this is well founded, in principle. Consider this Java code:

String x1="John Smith";

String x2="Lucy Smith";

String x3="John Smith";
Is this code inefficient? Won’t the Java compiler create 3 strings whereas only 2 are required? Not at all. Java is smart enough to recognize that it needs to store only 2 strings. Thus, there is no reason for this non-normalized table to be inefficient storage-wise even though Jane Wright appears twice:

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
456 Jane Wright 555-776-4100
789 Maria Fernandez 555-808-9633

Nevertheless, the First normal form article on wikipedia suggests normalizing the data into two tables:

Customer ID Telephone Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633
Customer ID First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez

Pros of the normalized version:

  • It does look like the normalized version uses less storage. However, the database engine could compress the non-normalized version so that both use the same space (in theory).
  • We can enforce the constraint that a customer can have a single name by requiring that the Customer ID is a unique key (in the second table). The same constraint can be enforced on the non-normalized table, but less elegantly.

Pros of the non-normalized version:

  • A customer could have different names depending on the phone number. For example, Edward Buttress could use his real name for his work phone number, but he could report the name Ed Butt for his home phone number. The power to achieve this is entirely in the hands of the software developer: there is no need to change the schema to add such a feature.
  • If you start with existing data or try to merge user accounts, the non-normalized version might be the only sane possibility.
  • The database schema is simpler. We have a single table! You can understand the data just by reading it. Most of your queries will be shorter and more readable (no join!).

To a large extent, it seems to me that the question of whether to normalize or not is similar to the debate of static versus dynamic typing. It is also related to the debate between the proponents of the waterfall method versus the agile crowd. Some might say that working with a non-normalized table is cowboy coding. In any case, there is a trade-off. Flexibility versus safety. But to my knowledge, the trade-off is largely undocumented. What are the opportunity costs of complex database schemas? How many databases do get unusable by lack of normalization?

I think that part of the NoSQL appeal for many developers is strong data independence. Having to redesign your schema to add a feature to your application is painful. It may even kill the feature in question because the cost of trying it out is too high. Normalization makes constraints easier, but it also reduces flexibility. We should, at least, be aware of this trade-off.

Note: Yes, my example goes against the current practice and what is taught in all textbooks. But that is precisely my intent.

Update: Database views can achieve a related level of data independence.

Why do we need database joins?

In a recent post, I argued that the current NoSQL trend could be called NoJoin. My argument boils down to the fact that SQL entices you to normalize your data which creates complicated schemas. Meanwhile, NoSQL database systems use simple schemas and are therefore easier to scale out.

Curt Monash has a reasonable post where he points out that we need joins because we normalize. Furthermore, he offers reasons for normalization:

  • To simplify the programming of the updates. Simply put, if the string “Montreal” appears once in your database, and the city changes its name, it is trivial to do the update. This applies mostly when you have complex schemas.
  • For faster updates. Updating a single entry in a database is much faster than searching and updating for all occurrences of the value “Montreal”. This is mostly applicable when you have large update volumes.

However, the case against joins is also strong:

  • Normalization makes your schemas complex. I have seen university databases made of hundreds of tables. The average query is well over 256 characters and involves dozens of joins. It is simply impossible to make sense of the content of any one table. Building new applications on top of this mess is expensive and bug prone. Complexity is bad for your health.
  • Database engines can compress the data automagically so normalization to save space is a waste of time.

The dogma of normalization too often leads to over-engineering. We are so afraid that a programming error could leave the database in a wrongful state that we invest massively in inflexible schemas. In turn, this over-engineering comes back to haunt us when we need to be more agile, or to scale out.

Example:

Suppose you want to design a database of research papers. Let us simplify the problem by omitting the paper identifiers, the dates, and so on. Let us also assume that there is only one author per paper. Maybe your main table looks like this:

authorID author name publisher title
smith01 John Smith Springer Databases are bad
lampron01 Nathalie Lampron IEEE The other guy is wrong, databases are good

Being helpful, your friendly database expert points out that your database schema is not even in the second normal form. Clearly, you are an amateur. Being helpful, he creates a secondary table which maps the authorID field to an author name. And voilà! You have saved storage, and won’t ever get someone’s name wrong. Updates to someone’s name will be much faster in the future.

But wait?!? What if Nathalie gets married and changes name? And indeed, people have their names changed all the time. Yet, we never retroactively change the names of the authors on a paper. Maybe you never thought about it, but many ladies hold two or more names in their lifetime. Did the bunch of guys in IT knew about this? (As an aside, are the digital librarians worried at all about researchers changing name and seeing their publication list cut in half? Yes: See update below.)

My point is that normalization effectively enforces dependencies decided upon when you created the schema. These envisioned dependencies break down all the time. Life is complicated. I could come up with hundreds of examples. Strict normalization makes as much sense as the waterfall model.

What about the physical layer? Because normalization has removed entire fields from the main table, you might think that normalization will save storage! That may well be true in the database engine you are using. However, other database engines will automatically detect the dependencies and compress the data accordingly. In this case, it is trivial to discover that there  is a bijective (1-to-1) mapping between author ID and author name. And if the bijectivity breaks down, the database engine will simply have to work a bit harder to compress the data. Your code won’t break down. It won’t need to be retested. (To be fair, I don’t know if any database system gets this right.)

Update: Apparently, Otfried Cheong—a Computer Science professor in Korea—once published as Otfried Schwarzkopf. At least, the two names are merged on DBLP. It suggests that DBLP can cope with researchers changing their name.

Why you may not like your job, even though everyone envies you

In a provoking post, Matt Welsh, a successful tenured professor at Harvard, left his academic job for an industry position. It created a serious malaise: his department chair (Michael Mitzenmacher) wrote a counterpoint answering the improbable question: “why I’m staying at Harvard?” To my knowledge, it was the first time a departement chair from a prestigious university answered such a question publicly. Michael went even as far as arguing that, yes, indeed, he could get a job elsewhere. These questions are crazy if we consider that for every job advertised at Harvard, there are probably hundreds of highly qualified applicants.

But let me get back at Matt’s reason for leaving a confortable and prestigious job at Harvard:

(…) all of that extra work only takes away from time spent building systems, which is what I really want to be doing. (…) At Google, I have a much more direct route from idea to execution to impact. I can just sit down and write the code and deploy the system, on more machines than I will ever have access to at a university. I personally find this far more satisfying than the elaborate academic process.

In other words, Matt is happier when his work is more immediately useful. But where does the malaise about his decision comes from? After all, he will probably make as much or even more money at Google. Matt is not alone, by the way, Matthew Crawford, a Ph.D. in philosophy, left a high paying job in an American think tank for a job repairing motor bikes. His book Shop Class as Soulcraft tells his story.

I think that Matt’s decision might be hard to understand, at least, his department chair feels the need to explain it to us because he is putting into question the very core values of our society. These core values were explored by Veblen in his unconventional book The Theory of the Leisure Class. He argued that we are not driven by utility, but rather by social status. In fact, our society pushes us to seek high prestige jobs, rather than useful and productive jobs. In effect, a job doing research in Computer Science is more prestigious than an industry job building real systems, on the mere account that it is less immediately useful. Here are some other examples:

  • The electrician who comes and wires your house has a less prestigious job than the electrical engineer who manages vague projects within a large organization.
  • The programmer who outputs useful software has a less prestigious job than the software engineer who runs software projects producing software that nobody will ever use.
  • The scientist who tinkers in his laboratory has a less prestigious job than the scientist who spends most of his time applying for research grants.

Note how money is not always immediately relevant. While it is normally the case that manual labor has lower pay, it is almost irrelevant. And indeed, plumbers make more than software developers in some parts of the world (like Montreal)… Even though software jobs are usually considered more desirable.

There are at least three problems with this social-status system:

  • Nature is the best teacher. Working on real problems makes you smart. The German philosopher Heidegger famously made this point with a Hammer. To paraphrase him, it is not by staring at a hammer that we learn about hammers. Similarly, scientists who do nothing but abstract work in the context of funding applications are missing out. The best scientists work in the laboratory, in the field; they tinker.
  • By removing ourselves from the world, we risk becoming alienated. We become strangers to the world around us. Instead, we construct this incoherent virtual reality which has often much to do with soviet-era industrialism. We must constantly remain vague because truth has become subjective. Whereas the hammer hits, whereas the software crashes, whereas the experiment fails… projects are always successfully, marketing is always right and truth is arrived at by consensus. Yet, we know deep down that this virtual reality is unreal and we remain uneasy, trapped between reality and virtuality. The perfect example are the financial markets which are creating abstract products with agreed-upon values. As long as everyone plays along, the system works. Nobody must ever say that the emperor is naked. Everyone must accept the lies. Everything becomes gray.
  • Human beings like to make their own stuff. We value considerably more what we did ourselves. You may be able to buy computers for $200, but nothing will ever replace the computer you made yourself from scratch. It may be more economical to have some Indian programmers build your in-house software, but the satisfaction of building your own software is far more than what you get by merely funding it. Repairing your own house is a lot more satisfying than hiring handymen.

To summarize: trading practical work for high-level positions is prestigious, but it may make you dumber, alienated and unhappy. Back when I was a graduate student, we used to joke about the accident. The accident is what happens to successful professors: they suddenly become uninteresting, pompous, and… frankly… a tad stupid.

Thankfully, there is hope. The current financial crisis, mostly because it couldn’t happen according to most economists, was a waking call. The abstract thinkers may not be so reliable after all! The millions of college graduates who are underemployed in wealthy countries all around the globe have unanswered questions. Weren’t these high-level abstract college degrees supposed to pay for themselves?

How do we fix this broken caste system and bring back a healthier relationship with work? Alas, we cannot all become plumbers and electricians. But it seems to me that more and more people are realizing that the current system, with its neat white collar jobs and rising inequalities, could be improved upon drastically. The Do it yourself (or Do it with others) wave has been a revelation for me. Yes: Chinese factories can build digital thermometers much cheaper than I can. But making your own digital thermometer is far more satisfying. Saving money by abstracting out reality is not a good deal. And of course, building real systems is not the same as finding money for your students to do it for you.

Further reading: Working long hours is stupid and Formal definitions are less useful than you think.

You probably misunderstand XML

When I took my current position, I was invited to teach a course on unstructured data. It is a sensible topic for a course: some say that between 80% to 90% of all enterprise data is unstructured. But I objected to the title for marketing reasons. How many students would take a course on unstructured data? I can hear the students asking “what’s that course about?” Thus, I proposed a better title for the course: information retrieval and filtering. Indeed, everyone wants to filter and retrieve data, right?

Meanwhile, there were already courses on structured data (that is, on databases and information systems). However, there was no course on semi-structured data. So I proposed one. But I couldn’t call it semi-structured data as hardly any student would know what the title meant. Instead, I proposed a course which, roughly translated, is called “Information Management with XML.”

Immediately, I got into trouble: how could I dare omit  SOAP and web services from a course on XML? I was annoyed by these comments. With some sense of irony, I decided to start dumping on my students some SOAP examples so that they could see the “beauty” [I’m being ironic] of using XML for data exchange on the web. So, there I was, trying to teach my students about semi-structured data, and I was asked to tell them about remote procedure calls, an irrelevant topic for my purposes.

Thankfully, it appears that history is on my side. Developers got tired of getting these annoying XML payloads. In time, they  started using JSON, a much more appropriate format for passing small loads of structured data between a server and an ECMAScript client. It uses fewer bytes and, more importantly, JSON is an order of magnitude faster than XML. When you ask on Stack Overflow whether you should be using SOAP you are being told to avoid SOAP at all costs. The developers have spoken. And as a result, the organization behind the SOAP stack decided to close shop.

Where does that leave XML at? Precisely where it started. XML is a great meta-example on how to deal with semi-structured data. And it is just as useful as ever. Want to deal with documents? DocBook and OpenDocument are great formats. Want to add semantic information to web pages? Microformats can do it. You want to exchange complex business data? The Universal Business Language probably does what you need. Some people are having luck with the SVG image format. You want to subscribe to my blog? Grab my atom feed. For these applications, you couldn’t easily replace XML by flat files or JSON. Nor should you try.

Alas, we ended up torturing XML by applying it to ill-suited purposes. We must learn how to select the best format. Does your data look like a table? Can a flat file do the job? Do you need a key-value format like JSON? Or maybe a simple text file? Or is your data more like an XML document? Take a good look at your data before picking a format for it.

Further reading: Indexing XML and Native XML databases: have they taken the world over yet?

Update: I don’t include configuration files in my list of proper XML applications.

How do search engines handle special characters? Should you care?

Matt Cutts is Google’s search engine optimization expert. He runs a great YouTube channel called Google Webmaster Central. He was recently asked how Google handles special characters such as ligatures, soft hyphens, interpuncts and hyphenation points. His answer? He doesn’t know.

Being a scientist, I decided to compare Google to the young upstart (Bing).  Using Google, the result sets from Kurt Goedel differ from those with Kurt Gödel. For example, I could only find the Kurt Goedel article from the uncyclopedia when searching for Kurt Goedel. Similarly, Google fails to realize that cœur and coeur are the same words. However, Bing knows that Goedel and Gödel is the same person. Bing knows that cœur and coeur is the same word.

While the consequences are small, they are nevertheless real:

  • Students may fail to find great references on Kurt Gödel because they search for Kurt Goedel. Indeed, most academic papers seem to prefer Gödel to Goedel.
  • A writer who tries to be typographically correct and writes cÅ“ur may get penalized when people search for coeur.

Score: Bing 1. Google 0.

Source: Will Fitzgerald. Special thanks to Mark Reid, Marek Krajewski, Jeff Erickson and Christer Ericson for the debate on Twitter motivating this blog post.

Who is going to need a database engine in 2020?

Given the Big Data phenomenon, you might think that everyone is becoming a database engineer. Unfortunately, writing a database engine is hard:

  • Concurrency is difficult. Whenever a data structure is modified by different processes or threads, it can end up in an inconsistent state. Database engines cope with concurrency in different ways: e.g., through locking or multiversion concurrency control. While these techniques are well known, few programmers have had a chance to master them.
  • Persistence is also difficult. You must somehow keep the database on a slow disk, while keeping some of the data in RAM. At all times, the content of the disk should be consistent. Moreover, you must avoid data loss as much as possible.

So, developers almost never write their own custom engines. Some might say that it is an improvement over earlier times when developers absolutely had to craft everything by hand, down to the B-trees.  The result was often expensive projects with buggy results.

However, consider that even a bare-metal language like C++ is getting support for  concurrency and threads and esoteric features like regular expressions. Moreover, Oracle working hard at killing the Java Community Process will incite Java developers to migrate to better languages.

Meanwhile, in-memory databases are finally practical and inexpensive. Indeed, whereas a 16 GB in-memory database was insane ten years ago, you can order a desktop with 32 GB of RAM from Apple’s web site right now. Moreover, memory capacity grows exponentially: Apple will sell desktops with 1 TB of RAM in 2020. And researchers predict that  non-volatile Resistive RAM (RRAM) may replace DRAM. Non-volatile internal memory would make persistence much easier.

But why would you ever want to write your own database engine?

  • For speed, some engines force you use nasty things like stored procedures. It is a drastically limited programming model.
  • The mismatch between how the programmer thinks and how the database engine works can lead to massive overhead. As crazy as it sounds, I can see a day when writing your engine will save time. Or, at least, save headaches.
  • Clever programmers can write much faster specialized engines.

Obviously, programmers will need help. They will need great librairies to help with data processing, data compression, and data architecture. Oh! And they will need powerful programming languages.

The future is already here: it’s just not very evenly distributed

It is not 9am yet. Nevertheless, I got a lot done:

  • I attended the thesis proposal of my student Eduardo via Skype. I was literally in my basement with a fresh cup of coffee, attending a presentation hundreds of kilometers away. Beside myself, there were professors from two different cities attending.
  • I watched a recorded concert of the latest Japanese rockstar, Hatsune Miku, on YouTube. Oh! And she is a synthetic character.
  • I read my morning news on my Star Trek-like iPad. I received my news from a hand-picked lot of 50 or so people worldwide. It has been months since I last read a newspaper.

Though I tried to convince my wife that the future has finally arrived, she insists that we always live in the present.

Well. Some people certainly insist on living in the past:

  • Canada’s largest science funding agency (NSERC) sent me a funding proposal to review. They had to send it to me on paper, by mail. The file has dozens of pages. I had to carry it home. I will have to find a way to recycle it after shredding it. What is wrong with giving me access to a PDF or HTML copy of the proposal? Can’t the person in charge of printing the proposal find another occupation?
  • Even though I have not had to look up a research article on physical paper in nearly a decade, a leading journal asked me to include page numbers in all my references. But these page numbers only make sense if you are holding the physical copy of the journal. Why would anyone look up articles from the physical copy when they can find and download research papers in seconds? I’m even told that graduate students in several schools are trained in how to use a (physical) library to look up references. Oh! The pain! Why do librarians insist on teaching obsolete skills? There is no shortage of useful skills a librarian could teach students in 2010. Indeed, I am amazed at how little most graduate students know about tracking research papers online.
  • My employer still sends me, every two weeks, a little enveloppe with a single paper in it which tells me how much money they put into my bank account. Of course, I know exactly what this amount is, as I can see it listed through online banking. So what is the point of this piece of paper? Create employment at a local printing press? Tracking pieces of paper is time consuming. Worse: the information is incomplete, even incomprehensible at times because of the  lack of physical space on the paper. Why can’t Human Resources setup a web site where I could get all the information I need to understand all the mysterious deductions appearing on my fake pay check? Why must they insist on printing useless pieces of paper? Why can’t they provide useful and modern tools instead?

Source: The title of my post is a quote from William Gibson.

Can you trust fixed-bit computer arithmetic?

Suppose that you have 10 pictures, and all lined up, they take 100 pixels. Is it safe to say that each picture has a width of x pixels if 10 x = 100?

We all know that a x = b has a unique solution x as long as a is non-zero. If you work with integers, then you can say that there is at most one solution.

Unfortunately, computer arithmetic is uglier. Over 32-bit integers, the equation a x= 0 has 231 solutions if a = 231. Indeed, as long as x is even, 231 x = 0.

But notice how I had to choose a large value of a to make computer arithmetic look bad. What if a is small? Certainly, if a = 1, then there is exactly one solution. If a = 2, then there are at most two solutions: 2 x can take any odd even and the most significant bit of x is discarded.

We can generalize this result a bit: if a = 2L, then there are at most 2L solutions x.

But we can generalize it even further! We have that a x = b has at most 2L integer solutions x if a is an L-bit non-zero integer. That is, as long as a is small, then a x = b has few solutions. The proof is technical, but not difficult:

Proof (Sketch). Let a is an L-bit non-zero integer and suppose we work with K-bit arithmetic. We consider the equation a x div 2L = b modulo 2K – L. Let j be the first non-zero bit of a. E.g., if a = 7 then j = 1. Because ai = 0 for i < j, we have that the value of a x 2L is independent of the last j -1 bits of x. Moreover, we have that the most significant bit of x which matters (xK-j+1) only matters for the last bit of a x. We can solve for this last effectual bit xK-j+1 in (ax)K = bK-L as a function of bK-L, a and the lesser bits of x. We can continue solving for the lesser bits of x by consider a x = b minus the last bit. (In effect, we have reduced the problem in an integer ring with K bits to a similar problem in integer ring with K-1 bits.) After K-L steps, the process terminates (K=L) leaving L bits in x that can be set freely generating 2L different solutions. QED

Getting back to our orignal problem: Because the integer 10 can be written using 4 bits (0b1010), the equation 10 x = 100 has at most 24 = 16 solutions. So if x was just picked at random, chances are good that the assertion would fail: for 32-bit integers, the probability of a false positive is at most 15/232 which is much, much less than 1%.

Further reading: M. Dietzfelbinger, Universal hashing and k-wise independent random variables via integer arithmetic without primes, in: STACS’96, 1996.

Write a Twitter application in 5 minutes

I spend much time alone, writing and thinking. Twitter helps me stay connected. I love the platform.

On Friday, I wanted to find the intersection between the users followed by any two individuals. Indeed, suppose that you like both Joe and Jill, and they have similar interests. Maybe whoever they both read is also interesting? I could not find a tool to do it, so I built it with python-twitter.

Anybody with a working knowledge of Python can do it in less than 5 minutes. I used only twenty lines of code (in total!!!). The code proved immediately useful.

If you do not know Python or Ruby. Learn one or the other. Tonight. It is powerful stuff.

Update: A few days after I wrote this post, Twitter made my script obsolete by changing their authentification requirements.

Working long hours is stupid

We do too much. We carry too many projects. This overproduction creates problems which we try to fix by working even more.

We value most what we create (see Made by hand and The upside of irrationality). To be happy, you want to focus on making interesting stuff. This takes time and dedication. Yet as Graham’s essay The top idea in your mind stresses, we often fall into the trap of thinking mostly about money and personal disputes. These thoughts pull us away from our interests and prevent us from doing great work. As an example, I hear that Tiger Woods isn’t playing great golf. I bet he is either stuck into money problems or personal disputes, or both.

It is hard to be overworked by writing a book, by writing research articles or by playing golf. People are overworked dealing with email, context switching, money, and touchy relationships. This abundance of work makes people sad and boring. And this type of work tends to reproduce. The more you have, the more you will have.

Unemployment and pollution are visible results of our overproduction. Yet there are many more negative side effects. In academia, we train more and more Ph.D.s every year. Yet we have had too many Ph.D.s in the job market since the seventies. We write more and more research papers every year, and spend more and more time applying for research grants… but professors spend less and less time on curiosity-driven research.

It is cool to produce great work, but it is not cool to work 60 hours a week unless it is out of passion. And nobody is passionate about grant applications, marking papers or handling difficult people. Moreover, working long hours does not scale: you can’t increase your output continuously.

Our productivity will keep improving. I can write software faster and better than ever. I can research prior work with ease. I can ask fancy mathematical questions on the Web and get answers in minutes. Instead of investing back this productivity into more silly work, we need to get smarter:

  • Focus on the essential: programming great software, writing a fun book, a set of inspiring lecture notes or an insightful article.
  • Automate, reduce or delegate. Reduce is best: doing fewer things is cool!
  • A focus on money or on personal disputes makes you stupid. Yet that’s where success often takes you. Watch out!
  • Airplanes pollute. Travel takes you away from your family. Cars pollute and make you fat. Do you need all that junk?

Is multiplication slower than addition?

Earlier, I asked whether integer addition was faster than bitwise exclusive or. My tests showed no difference, and nobody contradicted me.

However, everyone knows that multiplication is slower than addition? Right? In cryptography, there are many papers on how to trade multiplications for additions, to speed up software.

So? Can you predict which piece of code runs faster?

scalar product (N multiplications):

for(int k =0; k < N ; ++k)
answer += vector1[k] * vector2[k];

scalar product two-by-two (N multiplications):
for(int k =0; k < N ; k+=2)
answer += vector1[k] * vector2[k]
+vector1[k+1] * vector2[k+1];

non-standard scalar product (N/2 multiplications):
for(int k =0; k < N ; k+=2)
answer += ( vector1[k] + vector2[k] )
* ( vector1[k+1] + vector2[k+1] );

just additions (no multiplication):
for(int k =0; k < N ; ++k)
answer += vector1[k] + vector2[k];

Answer: Merely reducing the number of multiplications has no benefit, in these tests. Hence, simple computational cost models (such as counting the number of multiplications) may not hold on modern superscalar processors.

My results using GNU GCC 4.2.1 on both a desktop and a laptop:

algorithm Intel Core i7 Intel Core 2 Duo
scalar product 0.30 0.39
scalar product (2×2) 0.25 0.39
fewer multiplications 0.25 0.39
just additions 0.16 0.23

Times are in seconds. The source code is available without pointer arithmetics. The same test with pointer arithmetics gives faster results, but the same conclusion. I tried a similar experiment in Java. It confirms my result.

Code: Source code posted on my blog is available from a github repository.