Social Web or Tempo Web?

Back in 2004, Tim O’Reilly observed that the Web had changed, and coined the term Web 2.0. This new Web is made of several layers which enable the Social Web. Wikipedia and Facebook are defining examples of the Social Web.

This sudden discovery of the Social Web feels wrong to me. In the early nineties, I was an active user of Bulletin Board Systems (BBS). While it was not the Web, or even part of the Internet, BBSes were clearly a social media. You know the multi-user games people play on Facebook? We had that back in 1990. The graphics were poorer, obviously, but it was all about meeting people.

The barrier to entry keeps getting lower, to the point were even grand-fathers are now on Facebook. But the Web has hardly been limited to an elite. Even BBSes were quite democratic: retired teachers would chat with young hackers all the time. It is the extreme low cost of computers and their ubiquity which makes the Social Web so widespread.

A much more interesting change has received less notice: the tempo of the Web is changing. Geocities made it easy for anyone to create a home page. But updating your home page was a slow process. In effect, our mental model of the Web was that of a library, and Web sites were books that could be updated from time to time. Eventually, we gave up on this model and decided to view the Web as a data stream. This realization changed everything.

The pace used to range from static web pages to flaming on posting boards. We have now expanded our temporal range. We can now communicate with high frequency in short bursts. Twitter is one extreme: it is akin to techno music. Facebook is somewhat slower, and more elaborate, maybe  like rock. Posting research articles is no music at all: it is akin to the rhythm of the Earth around the Sun. These tools don’t just differ on the frequency of the updates, but also on their volume, and on the length of the pauses.

Maybe we should try to understand the Web by analogy with music. How does the Web sound to you, today?

Further reading: See my blog post Is Map-Reduce obsolete? Also, be sure to check Venkatesh Rao’s blog. Rao has a new book which I will review in the future.

Not even eventually consistent

Many databases engines ensure consistency: at any given time, the database state is logically consistent. For example, even if you receive purchase requests by the thousands, you will always have an accurate count of how many products you have sold, and how many remain in stock. Accountants are especially concerned with consistency: they have invented techniques such as the double-entry systems to favor consistency. Inconsistencies cause problems:

  • The user bought a product, a purchase record has been added, but the user account has not yet been charged. This could allow a user to buy more than he can afford.
  • All items in stock have been sold, but a customer is being told that a few items remain in stock. Thus, a vendor could make sales it cannot fulfill.

Yet, in practice, requiring consistency means that your system will become unavailable from time to time. Thus, many NoSQL databases have adopted an eventual consistency approach. That is, while at any point in time the system might be logically inconsistent, it will eventually recover:

  • The user account will  be charged prior to the delivery of the product.
  • A customer who ordered a product that is no longer in stock will be told prior to finishing his order.

Even though it give headaches to the developers, eventual consistency is good enough. In any case, robust systems have to deal with exceptions. Even if your systems tell you that there enough items on stock, it could be that one item was damaged in the warehouse, or that another vendor is willing to quickly provide you with missing items. That is, at best, the data in your database is a logically consistent abstraction. But all non-trivial abstractions, to some degree, are leaky. And that is why we pay accountants: they chase down the leaks.

Maybe we should keep in mind that the largest, most powerful, information system ever designed is logically inconsistent: the Web is full of dangling and misdirected hyperlinks. And it would be extremely hard to debug the Web. Thankfully, we do not need to. My own Web site is probably filled with mistakes. I am sure that I link to pages that no longer exist. Still, it works. The cost of maintaining the integrity would be too high. The errors are acceptable.

As an analogy, security experts rarely try to fully secure systems. Instead, they identify components that must be secured, and they determine the risk tolerance. Complete security is simply too expensive in practice, unless you are willing to live in an isolated bunker.

Thus, maybe the solution will be to accept that information systems that are not even eventually consistent. If I run a business, all that matters to me is that my customers are happy, and I am making a healthy profit. Everything else is secondary. I might be willing to tolerate that some customers are never charged for an item they receive. I might be willing to slightly overstock on some items.

Sure, some might describe a not even eventually consistent system as an abomination, something impossible to program for… but the Web would have been described this way in the 1980s. The key is that the Web is not inconsistent in any random way: it has its own viable logic.

Further reading: The CALM Conjecture: Reasoning about Consistency

Book review: Statistical Analysis with R

The programming language R is a standard for statisticians. And it is free software which runs on Windows, Mac and Linux.

You can learn much online about R, but if you prefer a bona fide book, there are also many to choose from. I just finished one of those: Statistical Analysis with R by John M. Quick.

The book is colorful and ludic which is a good idea for a “Beginner’s Guide”. The layout is attractive, there are many detailed examples. I like the pedagogy: the author wants you to learn by doing. However, the book is poor as a reference: the coverage is limited despite the 300 pages and the chapter summaries are strictly non-technical.

Overall this is a good book for people with no programming background who just want to use R to load data, do ANOVA tests and simple models. They will find step-by-step instructions down to the installation of the software, complete with screenshots of every step. This could be a good book for scientists or business people who want to try R as a substitute for Excel.

Disclosure: I got a free copy of the e-book from the publisher with the expectation that I would publish a review.

Innovating without permission

Is Open Source software better than closed-source software? Is Wikipedia better than Britannica? Is NoSQL better than Oracle and SQL Server? Are blogs better than corporate news services? Do patents favor innovation? Do long and painful funding applications help science? Is learning off Wikipedia and YouTube as good as learning in a classroom? Is duck typing good or bad?

All these questions are closely related to the concept of Innovation without permission: people innovate more when they do not have to ask permission.

Innovation without permission is analogous to Ridiculously Easy Group-Formation, popularized by Clay Shirky, but first proposed by my colleague Sébastien Paquet in 2002. Paquet’s message is that we are pushing the threshold for group creation to an unprecedented low.  In 2011, it is hard to disagree with Paquet: entire political movements come online together automatically.

Similarly, we are lowering the barriers to innovation to unprecedented lows:

  • Open Source developers are probably no better than closed-source developers, but they typical produce their work without asking permission. Moreover, Open Source has gotten much easier over the years. Starting a project was relatively difficult in the nineties. Then sourceforge came along. Today, we have Github which is even easier. Using Github, people I never heard about, and never approved, submitted patches to some of my code without permission! Fantastic!
  • When the era of personal computers arrived, a small team could publish a game, get it distributed through stores and sold to millions of people. Yet, last year, Markus Persson became a millionaire in a month by publishing an unfinished game as a Java applet from his web site (Minecraft). At each step in this process, the barrier to innovation becomes lower. First, you had to ask permission to the stores. Today, you can innovate and just post your work and, apparently, get paid very well for your effort.
  • Many of the best-selling titles in the Amazon kindle store come from self-published authors. People write and sell books without asking permission to a publisher! They are also making more money than they used to.
  • Wikipedia contributors and editors are no better than Britannica’s authors, but they mostly work without asking permission.
  • NoSQL databases often allow developers to add new features without having to convince a DBA to change the database schema.
  • Bloggers post without the approval of an editor.
  • Duck Typing allows you to use a function that was meant for a different data type without having to ask for a change in the API.
  • The barriers in scientific research has come down significantly. You can solve problems, post your solutions online and people might offer you a million dollar. It happened to Perelman. Increasingly, access to the literature is Open Access. Anyone can read the papers on PLoS and even contribute comments.

We can identify some barriers to innovation which are also great opportunities:

  • Patents and copyright are first on my list. Much of the copyrighted content is easier to pirate than to license. Unsurprisingly, people do not ask permission and they violate copyright. A possible outcome is that all industries could become like the fashion industry where copyright is largely irrelevant.
  • Funding remains a significant barrier to innovation.  The Kickstarter model is interesting in this respect.
  • While I love web applications such as Google Mail, YouTube and Facebook, they won’t let their users redesign the product. You can post innovative content, but you cannot contribute (much) to the architecture of the application. Twitter is maybe the exception where users redefined Twitter by using tags and retweets, both design elements that were not present in the system initially. I am amazed that nobody has thought of creating a web application letting people program and distribute video games online. (If you implement this idea well and become a millionaire, please send me a check.)

Demarchy and probabilistic algorithms

Demarchy are political systems built using randomness. Demarchy has been used to designate political leaders in Ancient Greece and in France (during the French revolution). In many countries, jury selection is random. On Slashdot, the comment moderation system relies on randomly selected members: as you use the system, you increase your (digital) karma which increases the probability that you will become a moderator. Demarchy has been depicted by several science-fiction authors including Arthur C. Clarke, Kim Stanley Robinson, Ken MacLeod and Alastair Reynolds.

Democracy is deterministic: we send the representative with the most votes. Demarchy  is probabilistic: we need a source of random numbers to operate it. Unfortunately, probabilities are less intuitive than simple statistical measures like the most frequent item. It is also much more difficult to implement demarchy, at least without a computer. Meanwhile, most software uses probabilistic algorithms:

  • Databases use hash tables to recover values in expected constant time;
  • Cryptography is almost entirely about probabilities;
  • Most of Machine Learning is based on probabilistic models.

In fact, probabilistic algorithms work better than deterministic algorithms.

Some common objections to demarchy:

  • Not everyone is equally qualified to run a country or a city. My counter-argument: Demarchy does not imply that everyone has an equal chance of being among the leaders. Just like in a jury selection, some people are disqualified. Moreover, just like Slashdot, there is no reason to believe that everyone would get an equal chance. Active participants in the political life might receive higher “digital karma”. Of course, from time to time, incompetent leaders would be selected, by random chance. But we also get incompetent leaders in our democratic system.
  • There is too much to learn: the average Joe would be unable to cope. My counter-argument: Demarchy does not imply that a dictator for life is nominated, or that the leaders can make up all the rules. In the justice system, jury selection covers a short period of time and scope (one case), is closely supervised by an arbitrator (the judge), and typically involves several nominates (a dozen in Canada). And if qualifications are what matters, then why don’t we elect leaders based on their objective results to some tests? The truth is that many great leaders did not do so well on tests. Most notably, Churchill was terrible at Mathematics. Moreover, experts such as professional politicians are often overrated:  it has been claimed that picking stock at random is better than hiring an expert.
  • Demarchy nominates would be less moral than democratically elected leaders. My counter-argument: I think the opposite would happen. Democratic leaders often feel entitled to their special status. After all, they worked hard for it. Studies show that white collar criminals are not less moral than the rest of us, they just think that the rules don’t apply to them because they are special. Politics often attract highly ambitious (and highly entitled) people. Meanwhile, demarchy leaders would be under no illusion as to their status. Most people would see the nomination as a duty to fulfill. It is much easier for lobbyist to target professional politicians than short-lived demarchy nominates.

Further reading: Brian Martin’s writings on demarchy and democracy and How to choose? by Michael Schulson.

Our institutions are limited by the pre-digital technology

Much of our institutions are limited by the pre-digital technology: (1) It is difficult to constantly re-edit a paper book; (2) without computers, global trade requires perennial currencies ; (3) without information technology, any political system more fluid than representative democracy cannot scale up to millions of individuals. These embedded limitations still constrain us, for now.

  • We teach kids arithmetic and calculus, but systematically fail to teach them about probabilities. We are training them to distinguish truth from falsehoods, when most things are neither true nor false. Meanwhile, sites like Stack Overflow and Math Overflow expose these same kids to rigorous debates, much like those I imagine the Ancient Greeks having. We are exposing the inner workings of scholarship like never before. Instead of receiving knowledge from carefully crafted books and lectures, kids have to be trained to seek out truth on their own. The new generation will expect textbooks to editable, like Wikipedia. You might be 16, but you can still correct a professor. But  the 12 years old next door can correct you too. Are we still going to have static books, idealized in some “final version”? I think that books and journal articles will have to become dynamic, or they will grow irrelevant.
  • Why do we need central banks and currencies? Because they, alone, can ensure determinism and stability? Why can’t we have several competing currencies? In any case, how wealthy is Mark Zuckerberg? Financial determinism is an illusion. There may be a picture of the Queen on my dollar bills, but the authority of the Monarch does not protect the value of this currency. If not for their fear of upsetting local governments, VISA and Mastercard could create a new currency in an instant. If not for government regulations, Paypal would be competing with central banks. Suppose that I help a game company. Maybe they could pay me with their in-game currency. Many people worldwide would be willing to trade this currency against other goods and services. At no point do I need to print money! We could even automate the process. I get some in-game currency and I want to buy food, can a computer arrange trades so that I get bags of rice delivered at my home? Of course, it can. Obviously, governments would start regulating because, these sort of trades effectively cut the government out.  But I think it would be very difficult to outlaw or regulate these trades in a context where the financial markets are driven by algorithms.
  • We expect political leaders to represent the interest of an entire population. Have you ever stopped to consider how insane this expectation is? Would you trust a jury made of a single person? We have learned from machine learning that the best results were typically achieved with Ensemble Methods. What we need are sets of leaders, automatically selected and weighted. We don’t want the best leader, we want the best ensemble of leaders. And the best way to avoid biases is to change these leaders frequently. We almost have the technology to do it in a scalable manner. Electronic voting is only the first step toward a new, more fluid, form of politics. If it sounds crazy, consider that people spend more and more time online, where politics is very different than in our brick-and-mortal world. The models emerging online will be implemented in offline politics. That is where the political innovation will occur.
Currency Mathematics Politics Knowledge
Primitive None One, Two or Many Tribe Stories
Moderate Local trade Arithmetic and geometry Monarchy Manuscripts
Advanced Central currencies Calculus Democracy Books, Authoritative Science
Upcoming Distributed and electronic Bayesian Demarchy Open and Participative Scholarship

Make your own programmable digital thermometer in an hour

I make my own yogourt because I cannot stand commercial yogourt. You can make your own yogourt in less than 30 minutes: heat milk to 112F (44C), mix in a small quantity of yogourt, leave the container of warm milk in a blanket overnight.

To minimize the labor I wanted  a digital thermometer with an alarm set at 112F. Alas, most kitchen digital thermometers are designed for cooking meat. They can take intense heat, but exposure to milk shortens their life considerably.

Inspired by Frauenfelder’s Made by Hand, I decided to make my own programmable digital thermometer.

You can do it in an hour and for less than $80. You do not need any specific knowledge. A kid could do it.

You need to order:

  • An Arduino  board ($30). I have used an Uno Arduino.
  • An LCD Keypad Shield ($24). I have used the DFRobot Shield.
  • A ZX-Thermometer Temperature Sensor by INEX Robotics ($8). (Update: my tests show that its surface is lead-free.)
  • You also need a USB cable to connect the Arduino to your computer, wires, a generic piezo element (as a buzzer) and a resistance. I recommend buying an Arduino Starter kit.
  • You may also need a 9V battery and an adaptor to connect it to the Arduino board. You can get a 9V to 2.1mm Barrel Jack for $3.
  • I have used a breadboard for convenience.

The assembly takes less than an hour (see my Picasa album for all the pictures).

  1. Put the LCD shield on the Arduino.
  2. Connect the thermometer on the analog port 2 (middle wire). Connect another wire to the 5V and another yet to ground.  You need to add a resistance between the 5V and the thermometer.
  3. Connect the piezo on digital 12. Don’t forget to ground  one half of it.
  4. Connect the Arduino to your computer by USB. Upload the program using the Arduino IDE. I have posted the C++ software on github. You can use Linux, Windows or a Mac to connect to the Arduino.

Operating the thermometer is easy. Give it some power. The display should come alive. Put the end of the black wire from the thermometer in your liquid. The up and down button control the target temperature. When it is reached, the piezo will play a bad rendition of Frère Jacques. Make sure you disconnect the battery once you are done to save power. This thermometer should work well for beer and yogourt making.

You can easily customize it by adding timers and several different target temperatures. Instead of a piezo element, you could use a voice synthesizer. Best of all, if the temperature sensor breaks, you only need to replace an $8 component.

Further reading: Arduino and Open hardware.

Disclosure: Though I link to RobotShop products, I am not affiliated with them in any way. The main Arduino web site has a list of Arduino distributors by country.

Code: Source code posted on my blog is available from a github repository.

For your in-memory databases, do you really need an index?

For large data sets on disk, indexes are often essential. However, if your data fits in RAM, indexes are often unnecessary. They may even be harmful.

Consider a table made of 10,000,000 rows and 10 columns. Using normalization, you can replace each value by a 32-bit integer for a total of 381 MB. How long does it take to scan it all?

  • When the data is on disk, it takes 0.5 s to scan the data. To maximize buffering, I have used a memory-mapped file.
  • When the data is in memory, it takes 0.06 s.

Can you bring the 0.06 s figure down with an index? Maybe. But consider that :

  • Indexes use memory, sometimes forcing you to store more data on disk.
  • Indexes typically slow down construction and updates.
  • Indexes typically only improve the performance of some queries. This is especially true with multidimensional data.

Verify my results: My Java code is available on paste.bin. I ran the tests on an iMac with an Intel Core i7 (2.8 GHz) processor.

Source: This blog post was motivated by a question from Julian Hyde, of Mondrian fame.

Further reading: Understanding what makes database indexes work

Code: Source code posted on my blog is available from a github repository.

Who will need database administrators in 2020?

In response to my Why do we need database joins? post, many readers stressed the importance of strict database schemas to preserve data integrity. In short, we want database administrators (DBA) to input constraints at design time so that the integrity of the database is insured no matter how lousy your programmers are. There is nothing a DBA hates more than having to recover a database from the backup tapes. And several businesses simply cannot afford to have their databases disrupted.

And really, businesses can be careless with their data. I have repeatedly seen businesses and public organizations hire students or recent graduates for 3 months so that they can add a feature or quickly build a web application. Every single time, I cringe. And almost every single time it ends badly. Managers shouldn’t mess with software and data without real professionals. Alas, the bad ones often do. In such a context, we need DBAs to protect the data, to keep the useful applications running.

So, unsurprisingly, we keep hearing that NoSQL databases fail to get any traction in large organizations. Is this a surprise? NoSQL tends to do away with schemas. It gives a lot more power to the programmer. More power to screw up as well. So,  we are getting at the heart of the matter. NoSQL is not meant for DBAs. In fact, it is a coup against DBAs:

NoSQL is for programmers. This is a developer led coup. The response to a database problem can’t always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves.

In effect, NoSQL developers are working to make DBAs less relevant (or even irrelevant). And, if history is our guide, they will succeed. I wouldn’t be surprised if in ten years, we declared database administration to be an obsolete occupation. The software will protect the data better than any DBA ever could. This revolution, however, cannot come from the vendors who sell to DBAs. We must have disruptive innovation. And this is exactly what NoSQL is.

Note: I have nothing against DBAs. I expect to be obsolete myself by 2020. (And hopefully, I’ll find a way to retire early.)

Over-normalization is bad for you

I took a real beating with my previous post where I argued against excessive normalization on the grounds that it increases complexity and inflexibility, and thus makes the application design more difficult. Whenever people get angry enough to post comments on a post of mine, I conclude that I am onto something. So, let’s go at it again.

On the physical side, developers use normalization to avoid storing redundant data. While this might be adequate with modern data database systems, I do not think this is well founded, in principle. Consider this Java code:

String x1="John Smith";

String x2="Lucy Smith";

String x3="John Smith";
Is this code inefficient? Won’t the Java compiler create 3 strings whereas only 2 are required? Not at all. Java is smart enough to recognize that it needs to store only 2 strings. Thus, there is no reason for this non-normalized table to be inefficient storage-wise even though Jane Wright appears twice:

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
456 Jane Wright 555-776-4100
789 Maria Fernandez 555-808-9633

Nevertheless, the First normal form article on wikipedia suggests normalizing the data into two tables:

Customer ID Telephone Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633
Customer ID First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez

Pros of the normalized version:

  • It does look like the normalized version uses less storage. However, the database engine could compress the non-normalized version so that both use the same space (in theory).
  • We can enforce the constraint that a customer can have a single name by requiring that the Customer ID is a unique key (in the second table). The same constraint can be enforced on the non-normalized table, but less elegantly.

Pros of the non-normalized version:

  • A customer could have different names depending on the phone number. For example, Edward Buttress could use his real name for his work phone number, but he could report the name Ed Butt for his home phone number. The power to achieve this is entirely in the hands of the software developer: there is no need to change the schema to add such a feature.
  • If you start with existing data or try to merge user accounts, the non-normalized version might be the only sane possibility.
  • The database schema is simpler. We have a single table! You can understand the data just by reading it. Most of your queries will be shorter and more readable (no join!).

To a large extent, it seems to me that the question of whether to normalize or not is similar to the debate of static versus dynamic typing. It is also related to the debate between the proponents of the waterfall method versus the agile crowd. Some might say that working with a non-normalized table is cowboy coding. In any case, there is a trade-off. Flexibility versus safety. But to my knowledge, the trade-off is largely undocumented. What are the opportunity costs of complex database schemas? How many databases do get unusable by lack of normalization?

I think that part of the NoSQL appeal for many developers is strong data independence. Having to redesign your schema to add a feature to your application is painful. It may even kill the feature in question because the cost of trying it out is too high. Normalization makes constraints easier, but it also reduces flexibility. We should, at least, be aware of this trade-off.

Note: Yes, my example goes against the current practice and what is taught in all textbooks. But that is precisely my intent.

Update: Database views can achieve a related level of data independence.