I get ten to fifteen questions a week on recommender systems from entrepreneurs and engineers. Sometimes, I help people find their way in the literature. On occasion—for a consulting fee—I get my hands dirty and evaluate, design or code specific algorithms.  But mostly, I answer the same questions again and again:

1. How much data do I need?

Given your data, you can use cross-validation or A/B testing to measure objectively the effectiveness of a recommender system.

2. We have this system in place, how do we know whether it is sane?

See previous question.

3. My online recommender system is slow!

Laziness is your friend: don’t recompute the recommendations each time you have new data.

4. My customers don’t like the recommendations!

  • Keep expectations in check: recommending products is difficult and even human beings have trouble doing it,
  • Explain the recommendations: nobody trusts a black box,
  • Allow your users to freely explore your data and products in convenient and exciting ways.

5. Which algorithm is best?

You should start with simple algorithms: it worked well enough for Amazon. To do better, a mix of different algorithms is probably best. You can combine them using ensemble methods.

One of the upsides of working for a university are the stimulating academic discussions. Yesterday, a philosopher challenged me a question:

Beyond the fact that software is expressed in Mathematics artefacts (bits, algorithms), are Information Systems fundamentally Mathematical?

For my convenience, I temporarily rephrase the question to something simpler and more concrete:

How are Software Developers limited by their mathematical weaknesses?

I plan several blog posts around this question, but let me start with an example.

A common and powerful language to process XML is XPath. XPath is used within web applications, scripts, databases, and so on. I often ask students the following question about XPath. Are these two expressions equivalent?

$x="some string"

and

not($x!="some string").

(The symbol “!=” means “different from”.)

Invariably, most students conclude that they are equivalent. Wrong!

Let us examine the semantics.

  • The expression $x="some string" means that at least one element of $x is equal to "some string".
  • The expression $x!="some string" means that some element of $x is different from "some string".
  • The negation of $x!="some string" is that all elements of $x are equal to "some string". (Sorry if it sounds confusing.)

Thus, the expression not($x!="some string") is a  more restrictive condition than the expression $x="some string".

Great software developers routinely think through far more complex mathematical problems. Yet, they do not think of them as being Mathematics.

Almost all software I write for my research is open sourced. Some fellow researcher argued today that I risk reducing the gap between and my pursuers. Similarly, I should keep my data to myself (and avoid listing good sources of research data).

Here is my take on this issue.

  1. Sharing can’t hurt the small fish. Almost nobody sets out to beat Daniel Lemire at some conference next year. I have no pursuer. And guess what? You probably don’t. But if you do, you are probably doing quite well already, so stop worrying. Yes, yes, they will give you a grant even if you don’t actively sabotage your competitors. Relax already!
  2. Sharing your code makes you more convincing. By making your work easier to reproduce, you are instantly more credible. Trust is important in science. Why would anyone trust that I actually wrote the code and ran the experiments? Because I published my code, that’s why!
  3. Source code helps spread your ideas faster. On the long run, you should not care about getting papers accepted at some hot conference. What matters is the impact you have had. Make it easy for me to use your ideas! Help yourself!
  4. Sharing raises your profile in industry. Having open source software makes your more attractive to software engineers.
  5. You write better software if you share it. While not all code I publish is bug-free, documented or even usable, I care slightly more about my code because I publish it.

Finally, does sharing code works? Do people download and use my software? Here are download statistics for my latest source-code publications:

A compressed alternative to the Java BitSet class over 280 downloads
Rolling Hash C++ Library over 200 downloads
Lemur Bitmap Index C++ Library over 2 000 downloads
Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping over 1400 downloads

Related reading: Good prototyping software and The challenge of doing good experimental work by Suresh Venkatasubramanian. And More on algorithms and implementation by Michael Mitzenmacher.

Update: Joachim Wuttke pointed out another potential benefit: your users will debug your code.

I am not opposed to the Publish or Perish mantra. I am an academic writer. I am what I publish. We all think of researchers as people wearing laboratory coats, working on exotic devices. And my own laboratory includes a one-million-dollar computer cluster with a SAN server as large as a fridge. I also generate much software. But you know what? The writing is what matters.

And publishing is easy. Write and submit many papers  conforming to the expectations of the editors. Eventually, some of your work will be accepted. And there are thousands of journals, conferences and workshops. Just write a lot.

Yet, don’t publish everything you write—even when what you wrote looks like a research paper. Hold on to it.  Because, publishing everything that looks like a research paper leads to what Feynman famously described as Cargo Cult Science. Indeed, there is a real danger that we become so good at faking science that we are no longer doing science at all! We become dishonest.

In our haste to be published…

  • we cut corners in our experiments, when we validate our ideas at all;
  • we pretend that our work is applicable in the real world, when it isn’t;
  • we don’t take the time to reproduce and reflect on known results;
  • we give the positive aspects of our research while omitting to mention the negatives;
  • we complexify the issues so that our research looks fancier;
  • we get lost in abstract nonsense.

If you want your work to really matter, you should be honest. You should not fool yourself and others. So what do we do? Maybe we should publish carefully. While barely reducing our output rate as academic writers, we can introduce extra steps to keep us more honest. What do we need?

  • Diverse point of views: it is easy to fool a small group of like-minded experts, but comparatively more difficult to fool the readers of my blog.
  • Time to reflect: if you read what you wrote months ago, and you don’t feel the urgency to communicate it more broadly, maybe it wasn’t all that good to begin with?

The problem is that once a paper is published in a journal or a conference, we tend to move on. Anyhow, we cannot easily revise our published work. Are there other models? Economists regularly publish working papers—commonly known in Computer Science as technical reports. But the difference between computer scientists and economists is that economists revise their working papers. And only when their work has stood the test of time, that is, has been available freely for months or years, do they submit it to conventional peer review.

This year, I will try the following experiment. Both on this blog and on my publication page, I will “publish” working papers and specifically ask readers to be critical of my work. Only after a couple of months have passed (or more) will I submit my work to a journal or conference.

This will introduce some latency in my publication output. Can I trade latency for quality? I plan to report back in a year on this (very public) experiment.

Further reading: Time for computer science to grow up by Lance Fortnow.

If you read my blog, you probably like to read in general. Thus, if you don’t own an ebook device, you will soon. The choice is growing: the Amazon Kindle, the Sony Reader, the Apple iPad,… I bought a kindle because my wife won’t let me fill the house with books. And I hate to throw away perfectly good paper books.

Amazon has most of the market for now. Yet, using the kindle store—on the kindle—is painful. Moreover, Amazon ebooks are protected by Digital Right Management (DRM). Amazon sells you crippled ebooks that can stop working if you copy them too often. There are often better alternatives elsewhere.

And, in Canada, there is a two-dollar surcharge for every wireless download using the Kindle. Since most ebooks are 0.5MB or less, the wireless costs 4$ per megabyte! This is insulting! Moreover, if you buy a book by mistake—which is annoying common—Amazon will reimburse the cost of the book itself, but not the fee for the wireless download.

Thankfully, you can grab books compatible with the kindle (in Mobipocket format) elsewhere. Then you can drop the file on the kindle using the USB port.

  • You can get nearly 2000 of the great French classic for free on ebookgratuits. This include a large fraction of the work of Honoré de Balzac.
  • Project Gutenberg offers 30,000 free e-books in various languages (mostly English).
  • WebScription sells DRM-free ebooks in various format. Most books fall into the scifi, young adults and fantasy genres.

I am currently reading You’re Not Fooling Anyone When You Take Your Laptop to a Coffee Shop by Scalzi. I bought it at WebScription for six dollars. It is a compilation of Scalzi’s blog posts on his life as a writer. I am fascinated by how much it ressembles my own life. Well… Except for the fact that I don’t get paid when I publish a paper. Maybe I should put together a compilation of posts about my silly work life. Would anyone buy it for six dollars?

I am also reading Halting State by Stross which I bought on Amazon for ten dollars. I haven’t yet gotten into the mood of the novel.

Further reading:

  • According to a Computer Scientist, the iPad could make Computer Science obsolete.
  • While I don’t think academic journals will be available on the Kindle any time soon, I think that has mostly to do with how insane academic publishing is.

Powered by WordPress