The most important Theoretical Computer Science problem is inconsequential

Some consider the P = NP problem to be the most important Theoretical Computer Science problem. It asks whether all problems whose solution can be verified quickly, can also be solved quickly. If you can answer this question, you win one million dollars.

The catch is that quickly is defined as in polynomial time. Thus, if you can solve a problem in time O(nx) where x is the number of electrons in the universe, then you are quick.

This is an annoyingly silly definition. Bubble sort is in O(n2). Yet, try replacing all sorting within Google’s infrastructure by this quick algorithm. Google would die.

Lance Fortnow asks us to take for granted that proving P = NP implies we have fast algorithms for all NP problems. William Gasarch just predicted that proving P ≠ NP would also help real world of algorithms.

Unknowingly to them, Zeilberger proved that P=NP on April 1, 2009. Yet, nothing happened. (He was kidding. Or was he?) Anyhow, enough with the dogma! While intermediate steps in the solution of the problem might be critically important to our understanding of computation, knowing that P = NP is inconsequential technologically.

This is as silly as claiming that sending men to Mars will cure cancer. It might very well that the research necessary to send men to Mars might lead to major breakthroughs, but whether we go to Mars or not has nothing to do with cancer.

Yes, please send men on Mars. Yes, please work on the P = NP problem. But stop claiming the answer would change the world.

Further reading: See my older post Is P vs. NP a practical problem?. Dick Lipton wrote a related blog post as well.

Relational databases: are they obsolete?

Michael Stonebraker is predicting that the dominance of the generic relational database is coming to an end. Having recently founded several database companies, he has a vested interested in this prediction .

Here is Stonebraker logic: we can outperform relational databases with specialized solutions. Therefore, users will migrate to specialized engines. In effect, specialized players such as Vertica will grab market shares from Oracle Database and Microsoft SQL Server.

Unfortunately, Stonebraker’s arguments are misleading. As far as performance is concerned, Stonebraker is obviously right: we are undergoing major changes. As pointed out by Daniel Tunkelang, you can store a lot of data in 32GB of RAM. Solid-state drives can be used to wipe out some IO bottlenecks. Yet, these technological changes will not change the game for two reasons:

  • We have always been able to outperform generic relational databases: (1) column stores have been around since the seventies when they were called transposed files (2) search engines have always used their own indexes (3) lightweight key-value engines like Tokyo Cabinet have always been around. Generic relational databases did not achieve dominance due to their superior performance.
  • Generic relational databases are frequently catching up to specialized engines. In particular, they are not limited to row stores. Curt Monash’s blog post on Oracle’s hybrid columnar approach makes this obvious. Nicolas Bruno, in Teaching an Old Elephant New Tricks, predicted that the lessons learned by start-ups such as Vertica will be integrated into traditional relational engines.

Further reading: I was motivated by the latest StorageMojo blog post. See also my blog posts Trading compression for speed with vectorization, Changing your perspective: horizontal, vertical and hybrid data models, Column stores and row stores: should you care? and Native XML databases: have they taken the world over yet?

The hard truth about research grants

You must do many silly things to get a large research grant:

  • You must know precisely what you will do for the next five years. Yet, in my experience, good researchers only have a vague idea of where they will be in 5 years. If you know the promising research directions you will encounter, you are an astrologer, not a scientist.
  • Your plan must be infallible. Yet, in my experience, research ideas that never fail are not research ideas at all! They are mere applications of known principles. There is no research without risk. The more ambitious the research, the riskier it is.
  • Your work must revolutionize your field. Yet, as stated above, it must also be  infallible.
  • Your work must have crucial applications. Make it up if you must! When Einstein came up with E=mc2, he should have pointed out the possibility of dropping nasty bombs on Japan. Again, we expect scientists to know the future.
  • You must work in large teams, packing up a broad range of researchers. Yet, there is no clear evidence that correlation exists between the resort to extramural collaboration and overall research performance. The goal—once more—is to minimize risks. It is unclear whether it actually reduces the risk at all. And the cost is added bureaucracy.
  • You must promise to train many new Ph.D.s. Somehow, flooding the job market so that there are hundreds of qualified applicants for every research job is highly rewarded.

Thankfully, Peter A. Lawrence has many good ideas on how to improve the system:

  • Grant applications should be short. Quick to prepare. Quick to process. Currently, it is not uncommon for scientists to spend months writing grant applications, and weeks reviewing them. Wasteful!
  • You should be able to apply for grants based on your record alone. The requirement to submit a project or research program is silly.
  • We should be critical of large groups. For some large projects, such as building a multi-billion-dollar machine, you need a group. To investigate the properties of some algebraic structure, you don’t.
  • You should only offer a handful of research papers for review. Pick your top 3 papers and don’t mention the rest. In Canada, NSERC applied this idea, but they did not go far enough. They ask us for our top contributions, and then they ask for the full list. By reviewing researchers on only their best work, we would lift the incentive to produce many weak papers.

I conclude from a quote by Michael Nielsen:

Some years back I constructed a list of papers I especially admired, and was surprised to discover that with only a few exceptions they were produced from unfunded research. This was sobering, since it suggest that receiving research grants was (…) anticorrelated with doing work of the highest quality. Grants seem to be good at sustaining an established area, but not very good at all at producing the conceptual innovations that start new subfields. (Michael Nielsen)

(And before anyone jumps at this… Michael did not write that having a large grant was incompatible with highly original research.)

How things change: Cheaters are Innovators

If you seek approval above all else, you are unlikely to innovate outside the rigid bounds of the current system:

  • You do not convince existing journals to give more respect to this new field you created. You go out and create your own journals and conferences. John von Neumann did not wait for his colleagues to approve of his work on Computers. In fact, he had to use threats to get what he wanted.
  • You do not convince libraries to embrace e-commerce. You create
  • You do not convince university librarians to stop worrying about what the publishers need. You go out and create Google Scholar.
  • You do not wait for’s management to approve the use of a recommender system. You do what Greg Linden did and squeeze the feature in during a test.
  • If you can prove Poincaré conjecture, you don’t wait for a journal to approve your work, you just post it on arxiv.

Changing a system from the inside is inefficient and failure-prone. Stop wasting your time and make the old system irrelevant. Do not seek approval. Go out and test your ideas yourself. Listen respectfully to others, but make up your own mind. Work within a large company, scientific community or University if you must, but never fall under the illusion that you can change it by playing fair. Cheating is the fundamental mechanism by which change happens. Cheaters are innovators.

How are you planning to cheat today?

Further reading: Is scientific publishing about to be disrupted by Michael Nielsen

Where do the best mathematicians come from?

Americans think that the best scientists come from their best universities. To learn more, consider where the influential mathematicians from:

got their degree in the USA 33%
got their Ph.D. in the USA 58%
working in the USA 68%

The American scientific dominance relies partially on the ability of the USA to attract and retain the best and the brightest:

  • 33.6% of European PhDs were attracted to faculty or research positions in the US.
  • 55% of non-European foreign PhDs were attracted to faculty positions in the US.
  • Only 10% of American Ph.D.s go work abroad.

Reference: Influential Mathematicians: Birth, Education and Affiliation

Further reading: It may not matter all that much where you go to college and Big schools are no longer giving researchers an edge?

Changing your perspective: horizontal, vertical and hybrid data models

Data has natural layouts:

  • text is written from the first to the last word,
  • database tables are written one row at a time,
  • Google presents results one document at a time,
  • the early recommender systems compared users to other users,
  • discussions are organized in newsgroups and posting boards by topic,
  • research papers are organized in journals and conferences,
  • objects have attributes (a ball is red), and from these attributes we determine similarities between objects.

Using a database terminology, these are horizontal layouts.

We can rotate these models to create vertical layouts:

  • Instead of writing text sequentially, we can store the locations of each word in an inverted index.
  • Instead of writing database tables row-by-row (e.g., Oracle, MySQL), we can write databases column by column (e.g., C-Store/Vertica, LucidDB, Sybase IQ, and my Lemur Bitmap Index Library).
  • Instead of presenting results sets one document at a time, we present tag clouds and use faceted search to support exploration. Thus, instead of listing documents, we focus on attributes (date, topic, author).
  • Recommender systems are often more scalable when they compare items instead of users: the most famous example is Greg Linden‘s Amazon recommender (if you liked this book, you may like…). For example, the Slope One algorithms outperform many user-to-user algorithms.
  • The social web started out with topic-oriented newsgroups and posting boards, but it is not dominated by user-centric blogs and social sites (such as Facebook or Twitter). Since then, we have realized that user-oriented blogs can be preferable.
  • While research papers are published in conferences and journals, I argue that we should turn this around and organize research papers by author through author-specific feeds.
  • Some AI researchers are suggesting that relations might be primary whereas attributes would be secondary.
Many of the best solutions are hybrids. For example, text search sometimes require full-text indexes such as suffix arrays and Oracle recently announced a row/column hybrid.
Take away message: If you are stuck, try to rotate your data model. If neither the vertical nor the horizontal model is a good fit, create an hybrid.

Toward author-centric science

Too many research papers in Computer Science are nonsense: they convey no worthy message. Yet they pass a Turing test of sort: at a glance, they are indistinguishable from interesting research papers. In fact, they are designed as nonsense from the beginning: the authors mimic the output of good research. The goal is to appear to be doing valued research. Some of these papers appear in top conferences, and even go on to be highly cited.

What is the difference between a Stephen King novel and the average horror manuscript? At first glance, probably not much. After all, it is all a matter of taste. Yet if you have read enough novels, you can recognize King’s mastery. Avid readers know who are the best writers and they stick with them. It is not sufficient to get a manuscript accepted by King’s publisher to get the attention of his readers. People first care for the author. Stephen King is the brand.

In science, we publish by venue, by journal and by topic. Yet if we followed specific researchers, the same way we follow specific writers, these researchers would have a strong incentive to produce interesting work. It would not be profitable to write many uninteresting research papers since nobody would follow your work.

The idea is not new nor original. Back in 2005, I urged researchers to setup RSS feeds for their publications. Arxiv recently made author feeds available. Creating RSS feeds is technically easy. We have no excuse.

Please: if you want me to read you, make it easy to receive notice whenever you publish a new research paper. If enough researchers do it, we might improve the quality science significantly. Why wouldn’t you want people to follow your work?

Further reading: Scholarly Communications must be Syndicated by Gideon Burton

Eating my own dog food: You can subscribe to my research papers by email or through a reader.

Why I am not publishing in PLoS One, yet

PLoS One is a new peer-reviewed journal (2006) with many interesting features:

Unfortunately, for a Computer Scientist, it is not yet attractive:

  • The Computer Science section is filled with biology and medicine papers making use of Information Technology. In other words, the PLoS One taxonomy  confuses Information Technology and Computer Science! Thankfully, I could find one article in Natural Language Processing which might be the first and only Computer Science paper published in PLoS One. So there is hope.
  • As a related point, PLoS One is not indexed at the usual places as a Computer Science journal (DBLP, ACM DL, and so on). Of course, no Computer Science indexing is possible until PLoS One correctly classifies the Computer Science articles.

If they could fix these problems, I would gladly submit some of my work to them. PLoS One could become a useful journal in Computer Science over time. What about prestige? PLoS One uses article-level metrics. Instead of trying to be a prestigious journal, PLoS One helps you measure the impact of your own papers.

    Update: PLoS Computational Biology is now indexed by DBLP.

    Attributes of good research

    Paul Graham gives a list of attributes characterizing start-ups. It strikes me that many of these attributes could describe research projects as well:

    1. Good research projects fail. If there is no risk of failure, you are doing unoriginal research. (Except that out of my biggest failures have come out some of my best papers…)
    2. Good research directions change frequently. Otherwise, how can you be following the truth where it must lead you? (Except that if you keep changing direction, you’ll never get anything done.)
    3. It takes little money. While some research projects are expensive, Einstein changed Physics forever without a research grant. (Except that if you are worrying about your next pay check, you can hardly worry about research.)
    4. Good research is threatening. If your research never upsets anyone, maybe you are not pushing hard enough? (Except that you should not be bold just for boldness sake.)
    5. Research is a solitary task. Ultimately, all research projects involve many hours working alone. (Except that research is fundamentally social!)

    Trading compression for speed with vectorization

    Bitmap indexes are used by search engines (such as Apache Lucene), they are available in DBMSes such as Oracle and PostgreSQL. They are used in column stores such as the Open Source engines Eigenbase and C-Store, as well as by many commercial solutions such as Vertica.

    Bitmap indexes are silly data structures. Map each value to an array of booleans. Hence, if you have n rows in your table, and k distinct values, you get an n by k matrix containing booleans. Thus, some people falsely assume that bitmap indexes are only adequate when there are few distinct values (e.g., the gender column, male and female being the only two options). However—using techniques based on run-length encoding—the total size of your bitmaps is proportional to the size of the original table, irrespective of the number of distinct values!

    Bitmap indexes are fast because they benefit from vectorization. Indeed, let the predicate “sex=male” is satisfied on rows 1, 5, 32, 45, 54 and 63. I can determine which rows satisfy the extended predicate “(sex=male) AND (city=Montreal)” using a single instruction! The secret? A bitwise AND between the bitmaps “sex=male” and “city=Montreal”. You can compute unions, differences and intersections between sets of integers in [1,N] using only N/64 operations. All microprocessors have built-in parallelism because they operate on several bits at once.

    To benefit from vectorization, you need to store the data in a word-aligned manner: that is, you store consecutive segments of bits uncompressed. The longer the words, the less compression. Roughly speaking, 64-bit bitmap indexes are nearly twice as large as 32-bit bitmap indexes. What is the effect on the processing speed? We found that despite being much larger, 64-bit bitmap indexes were faster. That is right: it was faster to load twice as much data from disk!

    Yet, we often equate concise data structures with more speed. This assumption can be misguided. Given a choice between more compression, or more vectorization, I would choose more vectorization.


    Further reading: See my posts Compressed bitmaps in Java, To improve your indexes: sort your tables!, and The mythical bitmap index.

    To be smarter, ignore external rewards

    Last night, I watched a great talk by Dan Pink—author of several self-help books. He made a compelling point and he cited research papers. I went and read these research papers and I had great fun. Essentially, boosting your motivation with external rewards can lower the quality—though maybe not the quantity—of your work.

    Indeed, Ariely et al. tell us in Large stakes and big mistakes that rewards may fail to increase worker productivity on cognitive demanding tasks. Eriksson et al. describes the phenomenon as follows:

    (…) increased conscious attention to one’s own process of performance, implicit competition, cash incentives, or the presence of an audience can reduce performance (…)

    These results have been independently verified several times (see Mobbs et al. for example).

    My conclusion: Researchers or engineers pushed hard by external rewards will produce more, but not better work. To get higher quality and creative work, you need low pressure environment. This may explain why many creative people cut themselves off from the world for extended periods.

    Rewarding researchers with better funding, better pay and prizes, may increase the volume of their contribution, while lowering the quality of their work.

    Here is the talk that motivated this blog post:

    Further reading: We Perform Best When No One Tells Us What To Do

    Scifi book recommendations from my summer list: Reynolds, Hamilton and Weber’s Safehold

    Currently, I am finishing off  House of Suns by Alastair Reynolds. I am fascinated by Reynolds’ universe. Let me quote the beginning of the book:

    I was born in a house with a million rooms, built on a small, airless world on the edge of an empire of light and commerce that the adults called the Golden Hour, for a reason that I did yet grasp. I was a girl then, a single individual called Abigail Gentian.

    Abigail is not making any figure of speech. Why she lives in a house with a million rooms, why her empire was called the Golden Hour, why she points out that she once was a single individual, and a girl, each one of these questions has an intriguing answer.

    Here are some science-fictions books that I liked this summer.

    • I began the summer with Pandora’s Star (Commonwealth Saga, Book 1) by Peter F. Hamilton. What I found fascinating is the universe painted by Hamilton. A future where human beings live forever, travel between stars instantaneously, transform their bodies at will, extend their minds, and so on. If it sounds like utopia, it is! Except that a small group of terrorists are creating trouble, claiming that unseen aliens are manipulating us. Clearly these terrorists are madmen. Or are they? After the conclusion of the three books of the Commonwealth Saga, the story continues with the Void Trilogy, 1,200 years later. I will be waiting for the conclusion in The Evolutionary Void with high expectations. The quality of the writing is exceptional, the characters are compelling. Even the aliens are original, and that’s quite a feat.
    • I am not a big fan of David Weber. He is a good writer, but his stories are full of holes and shallow. Yet, his Safehold series is engaging. I had great fun with Off Armageddon Reef (Book 1 of the series). The fancy and unbelievable science-fiction plot is an excuse to relive through the construction of the British Empire. If you like to think about how technological innovations come about and you like military history, this is a great series. Do not expect spaceships or laser cannons.

    Open Access: just for articles!

    Many funding agencies and some universities require researchers to publish their articles as open access. That is, research articles must be available to all, freely. The main argument in favor of these policies is social justice: why should publishers acquire the exclusive rights of work funded by students, governments and other benefactors?

    Professor Steven Shavell goes further: we should abolish copyright for academic work altogether. At first, I was confused: once your research articles are under open access, what more is there? Quite a lot, it turns out.

    You see, these compulsory Open Access policies target exclusively research articles published in journals. These policies exclude: books, book chapters, conference proceedings, reports and so on.

    Why? My colleague and Open Access leader Stevan Harnad explains:

    Books are still largely preferred by users in analog form, not digital-only—journal articles are increasingly sought and used in digital form, (…) It is not clear that for most or even many authors of “academic works” (…) the sole “benefit” sought is scholarly uptake and impact (…), rather than also the hope of some royalty revenue

    Using the economic model as an argument is the stance of publishers resisting Open Access. To them, we answered: your financial well-being is not our concern. Yet, a researcher writing a book—with funding from the government—should be allowed to ignore the ideals of Open Access for his own profit? Hardly fair, I say! Anyhow, I have written tens of articles in conference proceedings and I have edited a couple of books but I never received a penny. Professors do not edit books or write book chapters for profit.

    Consider the larger picture. Would our arguments change if paper books were replaced by Amazon or Sony ebook readers? And what happens if publishers start paying authors for research articles? Would our arguments in favor of Open Access melt? Yes, paper is expensive, but Open Access policies do not prevent publishers from charging anything they like! They just have to make sure it is cheaper to buy the book than to print it locally.

    What is the real reason we want to exclude books, book chapters and proceedings from Open Access policies? For-profit non-academic publisher already make available ebooks for free. What is our real excuse?

    Note: I hold no grudge against the big publishers such as Springer.

    Further reading: Is Open Access publishing the solution? Really?

    A recipe for interesting Computer Science research papers

    In Are your research papers telling original stories?, I claimed that the main benefits of the typical research paper were that:

    • the contribution to the state-of-the-art is clear (what did you invent?);
    • we can quickly quantify the value of the contribution (how well does it work?).

    Basically, research papers are fitted to the needs of the current peer review system.

    The current breed of research papers are also convenient. There are millions of ways of improving any given process. Each improvement can become a research paper. You can even proceed systematically. Pick any given solution to a problem and add a twist to it. Can you solve the problem faster? Can you solve the problem by using less memory? Can you solve the problem incrementally? And so on. You can manufacture countless research papers without ever learning anything new. And because you measured and categorized all of your contributions, you are even likely to get much recognition! Moreover, because you invented many new things, you may even get your name on a framework, algorithm or problem! If any of what you did is useful for industry, you may even get rich!

    But I may not find your work interesting. I would like to propose an alternative recipe that should produce more interesting research papers:

    • Pick any process followed by practitioners or by nature. How do human beings or ants solve a given problem? What heuristics do successful engineers follow?
    • Explain, model or reproduce the process in question.

    There are endless puzzles out there. For example, I have no satisfactory explanation of why wikipedia worked. Had I been asked about a project like wikipedia in the nineties, I would have predicted failure. Admit it: you would have done the same. Yet, it worked. Why?

    Look at how the best programmers work. They have many clever tricks (algorithms, processes, strategies) that you will never find in any textbook. Sometimes these tricks work unreasonably well. But we have no explanation.

    Remember: Nature is the best coauthor.

    Further reading: Write good papers.

    Scientists and their emotions

    Science is not a matter of pure logic. Some of the best scientists lived through intense emotions—it shaped their lifes. Here are a few quotes:

    • Ludwig Boltzman (invented entropy): When he could not reach the standards he set for himself, he would be overcome by feelings of fear, suffering and depression. He killed himself on 5 September 1906.
    • Wallace Carothers: Despite his success with Nylon, he felt that he had not accomplished much. He checked into a Philadelphia hotel room and died after drinking a cocktail of lemon juice laced with potassium cyanide.
    • Rudolf Diesel: His family says that Diesel committed suicide because his invention was stolen (so he felt). A cross in his journal on the date he died was an indicator of suicide.
    • Ignaz Semmelweis: He was one of the first to understand that doctors should wash their hands, especially after touching corpses before handling pregnant women. Yet he was largely ignored despite his great results and even ridiculed as naive. He suffered from severe depression and died in an asylum.

    Related posts: Research productivity: some paths less travelled and Emotions killing your intellectual productivity.

    Update: Don’t worry about my own state of mind. I’m not depressed right now, but I am interested in how our emotions shape our research and our faculty to innovate.

    Do hash tables work in constant time?

    Theory in Computer Science—as in any other field—is based on models. These models make many hidden assumptions. This is one of the fundamental reason why pure theory is wasteful. We must constantly revisit our old assumptions and run experiments to determine whether our models are useful.

    Hash tables are a fundamental data structure. They are part of every elementary introduction to Computer Science. From a programmer’s point of view, they associate keys with values. Hash tables go back to the beginnings of Computer Science even though new variations keep being invented (e.g., Cuckoo hashing). Incredibly, hash tables were just recently integrated in the C++ language, but less conservative languages like Java, Python, Ruby, Perl, PHP, all have hash tables built-in.

    The secret sauce is that the key is first transformed (or hashed) to a number between 1 and m corresponding to a cell in an array. As long as this number of generated with sufficient randomness, hash tables are almost equivalent to looking up values in an array—except for the overhead of hashing the key.

    Thus, everyone knows that hash table queries run in amortized constant time. That is, as the number of keys increases, the average time necessary to recover a key-value pair does not increase.

    But wait! How do we compute hash values? Consider the standard universal hashing technique described in Corman et al. Introduction to Algorithms. We pick any prime number m. Pick randomly a number a in [0,m). Then h(x)= x a modulo m is a universal hash function. Multiplications of numbers in [0,m) can almost be done in time O(log m).  For the purposes of hash tables, the number m must be close to the number of different keys n. Thus—formally—hash tables often run in time O(log n).

    Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases!

    And what if your key is a string? As you have more keys, you are more likely to encounter longer strings which take longer to hash.

    Thus, it is possible for hash tables to run in time O(log n) in practice, despite what the textbooks say.

    What other hidden assumptions are you making right now?

    Note 1: Fortunately, universal hashing can be computed in linear time with respect to the number of bits, by avoiding multiplications.

    Note 2: The problems that result from picking the wrong model of computation are well known and addressed by most textbooks. I have not discovered anything new.

    Note 3: In a recent paper on fast string hashing, we show that, in practice, you can hash strings for a fraction of a CPU cycles per byte.

    Update: See my more recent post Sensible hashing of variable-length strings is impossible.

    Picking a web page at random on the Web

    To do statistics over the Web, we need samples. Thus, I want to know how to pick a Web page at random, without making much effort. If you are Google or Microsoft, it is easy. But what about the rest of us? And what if I want to pick users at random on Facebook?

    In effect, I want to sample a virtually infinite data set: I consider the Web to be infinite because I cannot enumerate all elements in it. Given finite resources against an infinite data set, can I sample fairly?

    Thankfully, the objects in this set are linked and indexed. Hence, I came up with two sampling strategies:

    • Using the hyperlinks: Start with any Web page. Recursively follow hyperlinks until you have many Web pages. Pick a page at random in this set. (Mathematically, I am sampling nodes in an infinite graph.)
    • Using the index: Sampling items Pick a set of common words, enter them into a search engine. Pick a page at random in the result set.

    Are there any other strategies? What can I say (if anything) about the biases of my sampling?

    Further reading: Benoit, D. and Slauenwhite, D. and Schofield, N. and Trudel, A., World’s First Class C Web Census: The First Step in a Complete Census of the Web, Journal of Networks 2 (2), 2007.

    Update: An anonymous reader points me to Ziv Bar-Yossef and Maxim Gurevich, Random sampling from a search engine’s index, JACM 55 (5), 2008.

    Update 2: Chris Brew points me to Henzinger et al., On Near-Uniform URL Sampling, WWW 2000.

    Update 3: Regarding Facebook, I found Gjoka et al., Unbiased Sampling of Facebook, 2009.

    A review of “Hello World: Computer Programming for Kids and Other Beginners”

    I learned programming on my own when I was twelve years old with a TRS-80 and Microsoft Basic. The documentation that came with the TRS-80 was fantastic. Alas, today, no vendor would ever think of including an introduction to programming with a computer. If your are a dad (or a mom) and you regret this, then Hello World: Computer Programming for Kids and Other Beginners by Warren and Carter Sande is for you. The book is cheap: while the list price is $34, Amazon sells it for a little over $23.

    As a kid, my initial goal was to create my own video games. And I did! I learned (entirely on my own) that:

    • Even though computers are fast, it is hard to create fast software. Being clever is hard work!
    • While it is easy to program a small game, programming a slightly more complex game can be orders of magnitude more difficult.

    This book should allow your kids to learn as much and more.

    Instead of using Microsoft Basic, the authors use Python. An excellent choice. They cover all the same material as in my venerable TRS-80 documentation: random numbers, variables, loops, graphics, functions and sound. However, there are a few more advanced concepts: dynamic arrays, arrays of arrays, objects, modules, file input and output, and event-based programming.

    Note that many of the examples of the book would not run on the latest Python release (3.1). That is a minor concern since I would recommend you stay away from Python 3 in any case.

    Disclosure: I was given a free e-book in exchange for my review.

    Why I hardly ever blog about my ongoing research

    When I started my blog in 2004, my goal was to blog about my research. It never happened.

    You may think that I am afraid a reader could steal my ideas, or that I might worry about looking silly. But I have no such fear. However, I am afraid it could hurt the quality of my work. I need a sandbox for my research. The purpose of the sandbox is to protect my ideas from the pressure:

    What scientists really need is more time and more freedom to play with their ideas without pressure to fit in, to publish, to make up their mind. (Backreaction)

    Working without pressure on your own ideas is essential for science. By opening up too early on your research directions, you can face at least two problems:

    • You commit to ideas before you had a chance of fighting self-delusion. Psychologists will tell you that once you write down and explain a position, you tend to stick with it. It is irrational, but true. I claim that this effect is less significant if you sketch ideas in your own private sandbox.
    • Early on, your ideas are still fragile: an expert can destroy them by apparently reasonable criticism before you had a chance to root them into a solid framework.

    In the early stage of my research, I am the best person to hold a critical view on my own work. Only later, when I have had the chance to explore the idea to my satisfaction, do I need the criticism of others.

    This is not just a theory. Several years ago, I shared an idea with a prominent scientist. He generously replied—in details—about why my idea was wrong. As part of his argument, he made a specific assertion which, if true, made my work uninteresting. A year later, I finally revisited the idea, and I determined that he was wrong, in a self-serving way.

    Silence and long hours alone in front of a desk are necessary for strong original ideas.

    Further reading: The notion of disputation arenas by David Brin.

    Netflix game gets exciting: BellKor’s Pragmatic Chaos is passed by The Ensemble

    This is fun. A month ago, I asked whether the Netflix competition was over. After BellKor’s Pragmatic Chaos merged the solutions of many players to break the 10% barrier, I expected them to win. It turns out that another coalition was created—The Ensemble—and they have beaten the previous best score by 0.01%. There are only a short few hours before the final deadline, so it is still impossible to know who will win. However, if The Ensemble wins, it will be very frustrating to BellKor’s Pragmatic Chaos: The Ensemble is a last minute player.

    Both teams broke the 10% barrier by using a diverse coalition, by merging several different ideas. As Peter Turney recently stated:

    There are no whole-truths, but we can get by reasonably well with a large number of half-truths.

    I am now more convinced than ever that science needs diverse explanations, techniques and opinions. We should actively reward creativity. Science is not merely about truth-seeking.