Reading recommendation: Saturn’s children by Charles Stross

I just finished Saturn’s children. This is my third Charles Stross novel after Accelerando and  Glasshouse. Saturn’s children presents itself as a light space opera novel. The hero is a robot-sex-slave who is running for her life, in a post-human world. The author does a great job of making us feel for her as she struggles to understand what is happening to her. In this sense, it reminds me of House of Suns by Alastair Reynolds. As you might expect, she thinks about sex quite a bit, and she has sex quite often too… with other robots.

The world depicted is beyond the singularity. In fact, human beings have disappeared. All these robot-servants are left to wander on their own without a sane legal or governmental framework. Human beings are gone, but likable robots, created somewhat in man’s image are struggling for emancipation.

Stross is one of my top five favorite scifi authors. His knowledge of Science, and Computer Science in particular, is genuine. Like all good scifi authors, he could have been a scientist, but we are lucky to have him as a writer instead.

Further reading: Stross runs a popular blog.

Getting a Ph.D. for the money?

Many of my Ph.D. students have admitted to being motivated by financial gain. Stanford is famous for their graduate-students-turned-entrepreneurs. Sergey Brin and Larry Page of Google fame come to mind. But I cannot yet point to a comparable success story. Still, I have seen several variations over the years:

  • Some students believe that having Ph.D. on their business card will earn them higher salaries.
  • Others believe that a Ph.D. is an ideal setting to come up with commercially-viable technology.

These approaches appear to violate Dijkstra’s second rule:

We all like our work to be socially relevant and scientifically sound. If we can find a topic satisfying both desires, we are lucky; if the two targets are in conflict with each other, let the requirement of scientific soundness prevail.

I have my own related experience. Shortly after getting my Ph.D., in the middle of a post-doctoral fellowship, I decided to run off and start a company with friends. My business plan was simple: I setup a web site where I said that I would do research for money. No ad, no networking, no product, no gimmick. Incredibly, people would call me with job offers. I did well in the sense that I never worried about money. (See this TV interview about some of my crazy industry work, in French.) I also learned much. My experience has helped me become a better researcher. Indeed, I have become very critical of published research. But I was never able to reconcile the pursuit of knowledge with commercial interests. I did a lot of crazy mathematics, programming, but I never had enough time to think deeply about any one issue.  And, as it turns out, I don’t love money nearly as much as I love knowledge.

Thus, I tell my students that they should be in it for the pursuit of knowledge. Some get upset. Others appear to ignore me. A few leave.

To sum up my opinion: if your goal is to become a highly-paid consultant or an engineer, the Ph.D. is an overkill.

What do you think?

What is more fundamental: Physics or Computer Science?

Computer Science can be taken a natural science: the study of how the universe processes information. If it is a natural science, then does it build on Physics? Or does Physics build on Computer Science?

The answer is obvious (to me): Without algorithms there would be no Physics! Physics is built on the fundamental assumption that we can model the world using algorithms. Computer Science is the most fundamental natural science.

Does Computer Science make fundamental (and possibly falsifiable) predictions about the universe? Of course, it does! Take the strong Church-Turing thesis: the assumption that the universe is a Turing machine. It has deep implications:

  1. There is no problem solvable by a human brain that cannot be solved by a machine. In particular, creativity and intuition are computable. Philosophically, we have no soul (not anymore than a PC).
  2. We all live within a discrete computer simulation. Physics is digital. Continuous functions (such as f(x)=sin(x)) are approximations to the discrete functions governing nature, and not the reverse. We all live in the Matrix.

The predictions are falsifiable:

  1. Come up with a task, such as writing poetry, that a brain can do, but not a digital computer.
  2. Show that we cannot reproduce a physical process using a digital computer.

Yet, these predictions remain unfalsified to this day.

To put it bluntly: The most fundamental of all natural laws is that we live in a Turing machine.

Further reading: The World is Digital (blog post by Dick Lipton)

Sensible hashing of variable-length strings is impossible

Consider the problem of hashing an infinite number of keys—such as the set of all strings of any length—to the set of numbers in {1,2,…,b}. Random hashing proceeds by first picking at random a hash function h in a family of H functions.

I will show that there is always an infinite number of keys that are certain to collide over all H functions, that is, they always have the same hash value for all hash functions.

Pick any hash function h: there is at least one hash value y such that the set A={s:h(s)=y} is infinite, that is, you have infinitely many strings colliding. Pick any other hash function h‘ and hash the keys in the set A. There is at least one hash value y‘ such that the set A‘={s is in A: h‘(s)=y‘} is infinite, that is, there are infinitely many strings colliding on two hash functions. You can repeat this argument any number of times. Repeat it H times. Then you have an infinite set of strings where all strings collide over all of your H hash functions. CQFD.

Further reading: Do hash tables work in constant time? (blog post) and Recursive n-gram hashing is pairwise independent, at best (research paper).

Trading compression for speed with vectorization

Bitmap indexes are used by search engines (such as Apache Lucene), they are available in DBMSes such as Oracle and PostgreSQL. They are used in column stores such as the Open Source engines Eigenbase and C-Store, as well as by many commercial solutions such as Vertica.

Bitmap indexes are silly data structures. Map each value to an array of booleans. Hence, if you have n rows in your table, and k distinct values, you get an n by k matrix containing booleans. Thus, some people falsely assume that bitmap indexes are only adequate when there are few distinct values (e.g., the gender column, male and female being the only two options). However—using techniques based on run-length encoding—the total size of your bitmaps is proportional to the size of the original table, irrespective of the number of distinct values!

Bitmap indexes are fast because they benefit from vectorization. Indeed, let the predicate “sex=male” is satisfied on rows 1, 5, 32, 45, 54 and 63. I can determine which rows satisfy the extended predicate “(sex=male) AND (city=Montreal)” using a single instruction! The secret? A bitwise AND between the bitmaps “sex=male” and “city=Montreal”. You can compute unions, differences and intersections between sets of integers in [1,N] using only N/64 operations. All microprocessors have built-in parallelism because they operate on several bits at once.

To benefit from vectorization, you need to store the data in a word-aligned manner: that is, you store consecutive segments of bits uncompressed. The longer the words, the less compression. Roughly speaking, 64-bit bitmap indexes are nearly twice as large as 32-bit bitmap indexes. What is the effect on the processing speed? We found that despite being much larger, 64-bit bitmap indexes were faster. That is right: it was faster to load twice as much data from disk!

Yet, we often equate concise data structures with more speed. This assumption can be misguided. Given a choice between more compression, or more vectorization, I would choose more vectorization.


Further reading: See my posts Compressed bitmaps in Java, To improve your indexes: sort your tables!, and The mythical bitmap index.

Scifi book recommendations from my summer list: Reynolds, Hamilton and Weber’s Safehold

Currently, I am finishing off House of Suns by Alastair Reynolds. I am fascinated by Reynolds’ universe. Let me quote the beginning of the book:

I was born in a house with a million rooms, built on a small, airless world on the edge of an empire of light and commerce that the adults called the Golden Hour, for a reason that I did yet grasp. I was a girl then, a single individual called Abigail Gentian.

Abigail is not making any figure of speech. Why she lives in a house with a million rooms, why her empire was called the Golden Hour, why she points out that she once was a single individual, and a girl, each one of these questions has an intriguing answer.

Here are some science-fictions books that I liked this summer.

  • I began the summer with Pandora’s Star (Commonwealth Saga, Book 1) by Peter F. Hamilton. What I found fascinating is the universe painted by Hamilton. A future where human beings live forever, travel between stars instantaneously, transform their bodies at will, extend their minds, and so on. If it sounds like utopia, it is! Except that a small group of terrorists are creating trouble, claiming that unseen aliens are manipulating us. Clearly these terrorists are madmen. Or are they? After the conclusion of the three books of the Commonwealth Saga, the story continues with the Void Trilogy, 1,200 years later. I will be waiting for the conclusion in The Evolutionary Void with high expectations. The quality of the writing is exceptional, the characters are compelling. Even the aliens are original, and that’s quite a feat.
  • I am not a big fan of David Weber. He is a good writer, but his stories are full of holes and shallow. Yet, his Safehold series is engaging. I had great fun with Off Armageddon Reef (Book 1 of the series). The fancy and unbelievable science-fiction plot is an excuse to relive through the construction of the British Empire. If you like to think about how technological innovations come about and you like military history, this is a great series. Do not expect spaceships or laser cannons.

Do hash tables work in constant time?

Theory in Computer Science—as in any other field—is based on models. These models make many hidden assumptions. This is one of the fundamental reason why pure theory is wasteful. We must constantly revisit our old assumptions and run experiments to determine whether our models are useful.

Hash tables are a fundamental data structure. They are part of every elementary introduction to Computer Science. From a programmer’s point of view, they associate keys with values. Hash tables go back to the beginnings of Computer Science even though new variations keep being invented (e.g., Cuckoo hashing). Incredibly, hash tables were just recently integrated in the C++ language, but less conservative languages like Java, Python, Ruby, Perl, PHP, all have hash tables built-in.

The secret sauce is that the key is first transformed (or hashed) to a number between 1 and m corresponding to a cell in an array. As long as this number of generated with sufficient randomness, hash tables are almost equivalent to looking up values in an array—except for the overhead of hashing the key.

Thus, everyone knows that hash table queries run in amortized constant time. That is, as the number of keys increases, the average time necessary to recover a key-value pair does not increase.

But wait! How do we compute hash values? Consider the standard universal hashing technique described in Corman et al. Introduction to Algorithms. We pick any prime number m. Pick randomly a number a in [0,m). Then h(x)= x a modulo m is a universal hash function. Multiplications of numbers in [0,m) can almost be done in time O(log m).  For the purposes of hash tables, the number m must be close to the number of different keys n. Thus—formally—hash tables often run in time O(log n).

Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases!

And what if your key is a string? As you have more keys, you are more likely to encounter longer strings which take longer to hash.

Thus, it is possible for hash tables to run in time O(log n) in practice, despite what the textbooks say.

What other hidden assumptions are you making right now?

Note 1: Fortunately, universal hashing can be computed in linear time with respect to the number of bits, by avoiding multiplications.

Note 2: The problems that result from picking the wrong model of computation are well known and addressed by most textbooks. I have not discovered anything new.

Note 3: In a recent paper on fast string hashing, we show that, in practice, you can hash strings for a fraction of a CPU cycles per byte.

Update: See my more recent post Sensible hashing of variable-length strings is impossible.

Picking a web page at random on the Web

To do statistics over the Web, we need samples. Thus, I want to know how to pick a Web page at random, without making much effort. If you are Google or Microsoft, it is easy. But what about the rest of us? And what if I want to pick users at random on Facebook?

In effect, I want to sample a virtually infinite data set: I consider the Web to be infinite because I cannot enumerate all elements in it. Given finite resources against an infinite data set, can I sample fairly?

Thankfully, the objects in this set are linked and indexed. Hence, I came up with two sampling strategies:

  • Using the hyperlinks: Start with any Web page. Recursively follow hyperlinks until you have many Web pages. Pick a page at random in this set. (Mathematically, I am sampling nodes in an infinite graph.)
  • Using the index: Sampling items Pick a set of common words, enter them into a search engine. Pick a page at random in the result set.

Are there any other strategies? What can I say (if anything) about the biases of my sampling?

Further reading: Benoit, D. and Slauenwhite, D. and Schofield, N. and Trudel, A., World’s First Class C Web Census: The First Step in a Complete Census of the Web, Journal of Networks 2 (2), 2007.

Update: An anonymous reader points me to Ziv Bar-Yossef and Maxim Gurevich, Random sampling from a search engine’s index, JACM 55 (5), 2008.

Update 2: Chris Brew points me to Henzinger et al., On Near-Uniform URL Sampling, WWW 2000.

Update 3: Regarding Facebook, I found Gjoka et al., Unbiased Sampling of Facebook, 2009.

A review of “Hello World: Computer Programming for Kids and Other Beginners”

I learned programming on my own when I was twelve years old with a TRS-80 and Microsoft Basic. The documentation that came with the TRS-80 was fantastic. Alas, today, no vendor would ever think of including an introduction to programming with a computer. If your are a dad (or a mom) and you regret this, then Hello World: Computer Programming for Kids and Other Beginners by Warren and Carter Sande is for you. The book is cheap: while the list price is $34, Amazon sells it for a little over $23.

As a kid, my initial goal was to create my own video games. And I did! I learned (entirely on my own) that:

  • Even though computers are fast, it is hard to create fast software. Being clever is hard work!
  • While it is easy to program a small game, programming a slightly more complex game can be orders of magnitude more difficult.

This book should allow your kids to learn as much and more.

Instead of using Microsoft Basic, the authors use Python. An excellent choice. They cover all the same material as in my venerable TRS-80 documentation: random numbers, variables, loops, graphics, functions and sound. However, there are a few more advanced concepts: dynamic arrays, arrays of arrays, objects, modules, file input and output, and event-based programming.

Note that many of the examples of the book would not run on the latest Python release (3.1). That is a minor concern since I would recommend you stay away from Python 3 in any case.

Disclosure: I was given a free e-book in exchange for my review.

What FriendFeed got wrong

Don’t you feel sometimes like your brain is running out of storage space? Myself, I am very forgetful. I always seek new tools to extend my brain.

FriendFeed is a fantastic social networking site. It lets you integrate all of your activities from all over the Web into a single flow. You can browse mine. Unfortunately, it has failed to attract attention. Worse: many of its features have been copied by Facebook.

Instead of trying to compete with Facebook and Twitter, FriendFeed should offer lifestreaming: an indexed and exhaustive record of everything you have done, for all times. In effect, FriendFeed should be an external-memory extension to your brain.

Note: This blog post was inspired by a chat with my colleague and social-networking researcher Sébastien Paquet.

After Netflix? What next?

The Netflix competition is nearly concluded. We have learned that ensemble methods are the solution for more accuracy.

The recommender system community moves on. Immediate questions come to mind:

  • Researchers continue to use the Netflix data set. Will it remain freely available?
  • We need to study the effect of a 10% accuracy gain (measured by RMS) on user satisfaction. How do we go about it?

Otherwise, it seems that future research is bound by the available data. Models and theory alone have never had much of an impact on the field. Accordingly, there has been a surge of research on recommender systems using social network sites as a data source. Alas, social data is sparse, heterogeneous and ephemeral. Are all the low-hanging fruits gone?

Design a recommender system: they read your resume

Goodreads is a Social Web site about books. They need a recommender system. Thus, they issued a challenge: design their recommendation engine, and they will read your resume. I suppose this is the poor man’s version of the Netflix challenge.

There is nothing wrong with the challenge, but I wish the current Goodreads engineers would publish their own recommender system. How hard can it be to do item-based analysis?

Update: Students at Stanford built and tested a recommender system for Goodreads.

Column stores and row stores: should you care?

Most database users know row-oriented databases such as Oracle or MySQL. In such engines, the data is organized by rows. Database researcher and guru Michael Stonebraker has been advocating column-oriented databases. The idea is quite simple: by organizing the data into columns, we can compress it more efficiently (using simple ideas like run-length encoding). He even founded a company, Vertica, to sell this idea.

Daniel Tunkelang is back from SIGMOD: he reports that column-oriented databases have grabbed much mindshare. While I did not attend SIGMOD, I am not surprised. Daniel Abadi was awarded the 2008 SIGMOD Jim Gray Doctoral Dissertation Award for his excellent thesis on Column-Oriented Database Systems. Such great work supported by influential people such as Stonebraker is likely to get people talking.

But are column-oriented databases the next big thing? No.

  • Column stores have been around for a long time in the form of bitmap and projection indexes. Conceptually, there is little difference. (See my own work on bitmap indexes.)
  • While it is trivial to change or delete a row in a row-oriented database, it is harder in column-oriented databases. Hence, applications are limited to data warehousing.
  • Column-oriented databases are faster for some applications. Sometimes faster by two orders of magnitude, especially on low selectivity queries. Yet, part of these gains are due to the recent evolution in our hardware. Hardware configurations where reading data sequentially is very cheap favor sequential organization of the data such as column stores. What might happen in the world of storage and microprocessors in the next ten years?

I believe Nicolas Bruno said it best in Teaching an Old Elephant New Tricks:

(…) some C-store proponents argue that C-stores are fundamentally different from traditional engines, and therefore their benefits cannot be incorporated into a relational engine short of a complete rewrite (…) we (…) show that many of the benefits of C-stores can indeed be simulated in traditional engines with no changes whatsoever.  Finally, we predict that traditional relational engines will eventually leverage most of the benefits of C-stores natively, as is currently happening in other domains such as XML data.

That is not to say that you should avoid Vertica’s products or do research on column-oriented databases. However, do not bet your career on them. The hype will not last.

(For a contrarian point of view, read Adabi and Madden’s blog post on why column stores are fundamentally superior.)

Netflix competition is over?

The Netflix competition is a $1 million research competition to improve the Netflix movie recommender system by 10%. A large team called BellKor’s Pragmatic Chaos just announced that they won (update: unless someone can beat them in the next month). Among them is Yehuda Koren with whom I organized the 2nd Netflix-KDD Workshop and also some engineers from my home town (Piotte and Chabbert). I do not know how they will split the money, but I suspect each one of them will get at least 100k$. 

I want a study on the benefits of this new technology on the Netflix users.

Reference: See my older posts Proceedings of the Large-Scale Recommender Systems workshop and Netflix: an interesting Machine Learning game, but is it good science?

Why senior researchers and managers should analyze data themselves…

Scientists, businessman and even spies are supposed to analyze data collaboratively. Are they?

If you are a scientist, you are familiar with the following type of research collaboration: a lowly student collects the data, crunches the numbers and plots the data. Other collaborators—such as the professor—merely comment on the tables and plots. Similarly, the CEO sees the pie chart, while the assistant crunches the numbers. That is vertical collaboration: you clean the basement and I will clean the main floor.

Yet, reliable data analysis requires horizontal collaboration.  Indeed, there are downsides to task specialization:

  • By never looking at the data, senior scientists and managers rely on experience and hearsay. Their incoming bandwidth is dramatically reduced. Nature is the best coauthor. Consider how the best American spies were fooled prior to 9/11 while all the data to catch the terrorists was available. Bandwidth is a requirement to be smart.
  • When a single person crunches the numbers, hard-to-detect errors creep in. The problem is serious: Ioannidis showed that most research findings are wrong.
  • With nobody to review the source data, the sole data analyst is more likely to cheat. Why rerun these tests properly, when you can just randomly dismiss part of the data? People are lazy: when given no incentive, we take the easy way out.

The common justification for task specialization is that senior researchers and managers do not have the time. Yet, 30 years ago, researchers and managers did not type their own letters. Improve the tools, and reduce task specialization.

With Sylvie Noël, I decided to have a closer look. My preliminary conclusions are as follows:

  • There are adequate tools to support rich collaboration over data analysis. Collaboratories have been around for a long time. We have the technology! Yet, we may need a disruption: inexpensive, accessible and convenient tools. The current migration tower Web-based applications might help.
  • Given a chance, everyone will pitch in. To make our demonstration, we collected user data from sites such as IBM Many Eyes and StatCrunch. We then ran an Ochoa-Duval analysis. We find that the network of users within web-based data analysis tools is comparable to other Web 2.0 sites.

As a database researcher, I think that further progress lies with loosely coupled data (no big tables! no centralized tool!) and flexible visualization tools (stop the pie charts! go with tag clouds!). I am currently looking for new research directions on this problem, any idea?

Further reading: 

Stop generating metadata and access the full content!

 Many researchers advocate the use of metadata to help find or recommend content automatically. Metadata is certainly useful when aggregating content for human beings: I first read the titles of research papers before reading them. However, machines do better when they access at least some of content  (Lin, 2009). Moreover, metadata is of little value in ranking answers (Hawking and Zobel, 2007). 

I think that researchers cling to metadata because that is how we have indexed books for so long. When I was a kid, full text searches in a library was unthinkable. Yet, there is no escape: everything is miscellaneous. Folksonomies and ontologies will not save the day. When working with machines, let go of metadata and embrace the full content.

I am particularly puzzled by a common research approach. Take an object. Extract metadata. Then compare objects between themselves using the metadata, or use the metadata for retrieval. I understand that this may constitute a useful form of dimensionality reduction. Yet, researchers frequently omit to check whether it is necessary to extract metadata at all.


  • David Hawking and Justin Zobel, Does topic metadata help with Web search? Journal of the American Society for Information Science 58 (5), 2007.
  • Jimmy Lin, Is searching full text more effective than searching abstracts? BMC Bioinformatics 2009, 10:46, 2009.

Credit: Thanks to Andre Vellino for motivating this post.

Quantum databases: what are they good for?

Hu et al. just posted An efficient quantum search engine on unsorted database. They refer to an older paper by Patel (2001), Quantum database search can do without sorting. Apparently without any data structure or preprocessing, logarithmic-time search queries are possible in quantum databases.

Even if we did have affordable quantum computers, this would not be a big selling point. Building B-trees or sorting tables is hardly prohibitive.

I would be more interested in how quantum databases can handle low selectivity queries. For example, can a quantum computer sum up a large array of numbers in near-constant time? Our current technology solves these problems with a mix of parallelism, compression and brute-force.

All I know about quantum computers is that we do not think they can solve NP-hard problems in polynomial time. Is there any reason to see a bright future in quantum databases?

Why are dynamic languages easier than static languages?

Dynamic langages like Python or Ruby are considerably easier than static languages like Java and C++. John is asking us why:

How do you account for the huge increases in productivity that dynamic language advocates say they experience.

My answer is duck typing. It is powerful because formal and tight definitions are hard and less useful than we might expect.

The difference is fundamental:

Static view Dynamic view
Tight definitions Duck typing
Ontologies Folksonomies
Specific solutions Generic solutions

Further reading:

Its official: the standard programming language for multidimensional databases is MDX

Julian Hyde just announced that Oracle will support MDX: they were the last vendor to resist this Microsoft technology. MDX is to multidimensional databases what SQL is to relational database. 

I am disappointed that MDX is the best we could come up with. I don’t find MDX enjoyable.

Further reading: What are the computer langage people waiting for?

Compressed bitmaps in Java

A bitmap is an efficient array of boolean values. They are commonly used in bitmap indexes. The Java language has a bitmap class: BitSet.

Unfortunately, the Java BitSet class will not scale to large sparse bitmaps—the Sun implementation does not use compression.

I published a new free alternative: JavaEWAH.



  • Glen Newton gave me the idea for this project.