The paperless campus: still a long way to go

Today I spent money from a research grant. Here is the process:

  1. I grab the form in Excel format.
  2. I fill it out.
  3. I print the form.
  4. I sign it.
  5. I give it to a secretary.
  6. The secretary gets the chair of my research center to sign it.
  7. The form is then sent to accounting, by internal mail (on paper).
  8. They review the form.
  9. They enter the data in a computerized database.

Suggesting that it could be improved using technology from this century is crazy talk.

So, you know what’s important?

Most researchers are convinced that their current work is important. Otherwise, they wouldn’t do it. Yet, few of them work on obviously important things like curing cancer or solving world hunger. Rather, they do silly things like prove the Poincaré conjecture. A century to figure out some theoretical nonsense? Please!

So, why won’t researchers work on the important problems of our era?

The conventional explanation is that working directly on the major problems is like staring at the Sun. Instead, researchers must do routine work until an opening toward greatness opens up. So real researchers…

  • survey existing work,
  • comment on special cases,
  • provide theoretical justifications for empirical observations,
  • validate new theory experimentally, and so on.

That is, researchers are not architects. They use greedy algorithms:

  • Look at problems and results you can grasp with your current expertise,
  • Select an interesting problem which is a good fit for your expertise,
  • rinse and repeat.
  • Wait! You are close to solve a major problem? Jump on it!

Should scientists feel guilty that they can’t prove the importance of each increment? I think not. I think scientists are inefficient, but there is no better way known to man. Indeed, consider how real innovation is typically unpredictable:

  • The greatest difference between my Honda Civic and the car I drove as a teenager is that I can lock and unlock all doors with a remote. This single function made all the difference in the world for me. I drive my wife nuts as I keep playing with the remote: lock, unlock, lock, unlock… And people thought we would have flying cars!
  • I am sure that Google offered better search results that altavista. Yet, the real reason I switched to Google and never looked back is that they did away with the big annoying ads. Understanding that you didn’t need to annoy your users to make a profit was Google’s greatest innovation. (Don’t let them fool you into thinking PageRank had something to do with it.)
  • Amazon.com is by far the best e-commerce site on the planet. But what is different? In fact, a lot of small little things. On the surface, Amazon.com is just an HTML view of a database. But they have collected many small innovations that when put together make a huge difference.

My point is that innovation in the little things adds up to important and practical results. That is why academic researchers spend so much time writing surveys or studying to death a detail. They don’t think their own work will change the world, but they count on others doing the same thing. They hope that when they put it back together, the result is great. For the last two hundred years or so, they have fared extremely well.

To put it another way: greedy algorithms can be pretty good. They can certainly beat 5-year plans.

Further reading: Innovative ideas are indistinguishable from crackpot ones

External-memory shuffling in linear time?

You can sort large files while using little memory. The Unix sort tool is a widely available implementation of this idea. Files are written to disk sequentially, without random access. Thus, you can also sort variable-length records, such as lines of text.

What about shuffling? Using the Fisher-Yates algorithm also known as Knuth algorithm, you can shuffle large files while using almost no memory. But you need random access to your files. Thus it is not applicable to variable-length records. And indeed, the Unix sort command cannot shuffle. (It has a random-sort option, but it is not a shuffle. Meanwhile, the shuf command runs in RAM.)

A solution: Tag each record with a random number. Pick random numbers from a very large set so that the probability that any two lines have the same random number is small. Then use external-memory sorting. You can implement something similar as a single line in Unix.

A better solution? Shuffling is possible in linear time O(n). Sorting is a harder problem (in O(n log n)). Thus, using a sort algorithm for shufflin as we just did is inelegant. Can we shuffle in linear time without random access with variable-length records?

Maybe we could try something concrete? Consider this algorithm:

  • Create N temporary files, choose N large enough so that your entire set divided by N is likely to fit in RAM.
  • Assign each string to one temporary file at random.
  • Shuffle the temporary files in RAM.
  • Concatenate the temporary files.

Something similar was described by P. Sanders in Random Permutations on Distributed, External and Hierarchical Memory (Information Processing Letters, 1998). See also the earlier work by Sandelius (A simple randomization procedure, 1962) as well as Rao (Generation of random permutation of given number of elements using random sampling numbers, 1961).

Which is faster: integer addition or XOR?

The bitwise exclusive or (e.g., 1110 XOR 1001 = 0111) looks simpler to compute than integer addition (e.g., 2 + 9 = 11). Some research articles claim that XOR is faster. It appears to be Computer Science folklore. But is it true?

Which line runs faster? (The symbol “^” is the XOR.)

for(int k = 0; k < N; ++k) sum+= k;

for(int k = 0; k < N; ++k) sum^= k;

My result: In C++ and Java, both run at the same speed (within 1%).

Code: My source code is on github.

Language, Mathematics and Programming

Even if you have extensive training in Mathematics, the average Mathematics paper is undistinguishable from the ramblings of a madman. Many of these papers seek to solve narrow problems. And yet, we respect Mathematicians.

Software programming is a form of communication, usually between human beings and machines. While different in style, programming is a subset of the language of Mathematics. If you dig into the average source code, it is undistinguishable from ramblings, even if you are an expert developer.

Yet, we denigrate programming. Many will even deny that it is a Mathematical language. But Mathematics and Programming are not so different:

Mathematics Programming
Building on the previous research papers requires you to dig through endless piles of boring, badly written research papers. Maintaining millions of lines of codes written by various people over the years is difficult, boring, error-prone.
Inventing new theorems or new mathematical theories requires much creativity. Coming up with the next best iPhone application requires much creativity.
For most people, mastering even part of Mathematics requires a decade or more. Please read Teach yourself programming in ten years by Peter Norvig.
The language of Mathematics has directly contributed to technological progress. Electricity, engines, nuclear power, space travel all required extensive use of Mathematics. Google changed the world through the brilliance of its software engineers. The open source revolution has changed how people think about collaboration.
Some Mathematicians are widely recognized as being extremely smart. Some famous people have done a fair share of difficult and technical programming : Donald Knuth and TeX, Tim Berners-Lee and the Web, Linus Tovarlds and Linux.

Why is programming getting so little respect?

  • The intense commercialization of programming has commoditized it. As Paul Graham might say : painters where initially “portrait takers”. It is only when painting lost its commercial function that it became recognized as a noble art. However, just like painters always used their free time to create great art, the best programmers are open sourcing beautiful code all the time.
  • The study  of programming itself remains rather informal. You can get degrees in Computer Science, Computing Engineering or Software Engineering, but there is no degree in Programming. Programming is taught in universities, but generally only in the first few courses of a degree. Yet, there are degrees in Communication, Fine Art, Architecture, Music or Dance. While a degree in Computer Science or Software Engineer can make you a better programmer, the fact remains that your professors are not expert practitioners.

How can we fix this? I have this secret dream of setting up the equivalent of “Creative Writing” program, but for programmers. Call it “Creative Programming”. Basically, students would come together to write great code. Yes, such code might be useful commercially, but that would be a secondary consideration. The pursuit of greatness would be the only goal that matters. It would treat programming as a bona fide language. It would attract the best programmers as guest lecturers. Would this ever work out? I do not know.

I am sure that many will point out that my secret dream is impractical. Beauty should not come first : we want cheap, reliable, maintainable code. We also want programmers to be replaceable, inexpensive and practical. However, human beings can both pursue greatness while being practical. Compromise is possible.

Let me conclude by quoting Donald Knuth:

(…) computer programming is an art, because it applies accumulated knowledge to the world, because it requires skill and ingenuity, and especially because it produces objects of beauty. A programmer who subconsciously views himself as an artist will enjoy what he does and will do it better.

Further reading: The best software developers are great at Mathematics? and Is programming “technical”?

Who the heck got Universities into the email business?

My current employer, UQAM, refuses to allow email forwarding. Students would rather forward their emails to their existing GMail accounts, for example. And the IT Department (the SITEL) agrees that it would have several benefits. However, they refuse to allow it for the following reasons:

  • Email forwarding may create infinite email loops. These may disrupt services and require human intervention.
  • Invalid or failing remote servers may saturate the local servers as they are unable to forward the emails.
  • Professors and management send confidential information by email. Yet, without full control of the email service, the University cannot ensure the needed confidentiality.
  • With email forwarding, it may be impossible to ensure and prove that an email was received and read. Thus, homework assignments, administrative inquiries or security advisories may never reach the students, or we may be unable to prove that they reach the students because of email forwarding.
  • As a Canadian University, email forwarding puts us at risk that the emails may transit on American servers, where the Canadian law on privacy is not applicable.
  • Email forwarding may put students at risk if remote accounts are stolen or lost.

Can you help me debunk or mitigate these arguments? I know that some of these arguments are bogus, but I am looking for solid references. (Not that I expect to change their mind.)

A larger issue: shouldn’t universities stick with research and teaching? I understand that we must have networks, cables, computers, firewalls, but do we need to provide our students with email services?

Update: Turns out that our IT people encourage students who want forwarding to GMail (say) to use the POP3 protocol. It is unclear to me how email forwarding can be a dangerous practice whereas POP3 “forwarding” can be safe.

Is programming “technical”?

According to student evaluations, most of my students appreciate short programming assignments. Yet, every year, some students think that programming is below them or unimportant.

Maybe I should start my courses with this theorem:

Theorem. If you understand an idea, you can implement it in software.

There is no denying that programming requires a lot of technical knowledge. Most programmers do technical jobs, involving testing, building or refactoring code. But programming is ultimately a communication form. And it is as noble as Mathematics or English. Let us compare:

  • Writers are considered sexy and non-technical people. Yet, grammar and spelling are technical. Moreover, most writers earn a living by writing ads for boring products. Some of them make a living with grand novels, but fewer than you think.
  • Physicists are great thinkers. Yet, their mathematical derivations are often mind-numbing and technical. Many physicists spend years running extremely technical experiments. And when they don’t, they program extremely complex (and technical) simulations.

For some reason, being a writer is somehow considered more prestigious than being a programmer. If you ask me, Linus Torvalds is every bit as cool J. K. Rowling. And I’d rather have a lunch date with Linus.

Most common questions about recommender systems…

I get ten to fifteen questions a week on recommender systems from entrepreneurs and engineers. Sometimes, I help people find their way in the literature. On occasion—for a consulting fee—I get my hands dirty and evaluate, design or code specific algorithms.  But mostly, I answer the same questions again and again:

1. How much data do I need?

Given your data, you can use cross-validation or A/B testing to measure objectively the effectiveness of a recommender system.

2. We have this system in place, how do we know whether it is sane?

See previous question.

3. My online recommender system is slow!

Laziness is your friend: don’t recompute the recommendations each time you have new data.

4. My customers don’t like the recommendations!

  • Keep expectations in check: recommending products is difficult and even human beings have trouble doing it,
  • Explain the recommendations: nobody trusts a black box,
  • Allow your users to freely explore your data and products in convenient and exciting ways.

5. Which algorithm is best?

You should start with simple algorithms: it worked well enough for Amazon. To do better, a mix of different algorithms is probably best. You can combine them using ensemble methods.

The best software developers are great at Mathematics?

One of the upsides of working for a university are the stimulating academic discussions. Yesterday, a philosopher challenged me a question:

Beyond the fact that software is expressed in Mathematics artefacts (bits, algorithms), are Information Systems fundamentally Mathematical?

For my convenience, I temporarily rephrase the question to something simpler and more concrete:

How are Software Developers limited by their mathematical weaknesses?

I plan several blog posts around this question, but let me start with an example.

A common and powerful language to process XML is XPath. XPath is used within web applications, scripts, databases, and so on. I often ask students the following question about XPath. Are these two expressions equivalent?

$x="some string"

and

not($x!="some string").

(The symbol “!=” means “different from”.)

Invariably, most students conclude that they are equivalent. Wrong!

Let us examine the semantics.

  • The expression $x="some string" means that at least one element of $x is equal to "some string".
  • The expression $x!="some string" means that some element of $x is different from "some string".
  • The negation of $x!="some string" is that all elements of $x are equal to "some string". (Sorry if it sounds confusing.)

Thus, the expression not($x!="some string") is a  more restrictive condition than the expression $x="some string".

Great software developers routinely think through far more complex mathematical problems. Yet, they do not think of them as being Mathematics.

Open Sourcing your software hurts your competitiveness as a researcher?

Almost all software I write for my research is open sourced. Some fellow researcher argued today that I risk reducing the gap between and my pursuers. Similarly, I should keep my data to myself (and avoid listing good sources of research data).

Here is my take on this issue.

  1. Sharing can’t hurt the small fish. Almost nobody sets out to beat Daniel Lemire at some conference next year. I have no pursuer. And guess what? You probably don’t. But if you do, you are probably doing quite well already, so stop worrying. Yes, yes, they will give you a grant even if you don’t actively sabotage your competitors. Relax already!
  2. Sharing your code makes you more convincing. By making your work easier to reproduce, you are instantly more credible. Trust is important in science. Why would anyone trust that I actually wrote the code and ran the experiments? Because I published my code, that’s why!
  3. Source code helps spread your ideas faster. On the long run, you should not care about getting papers accepted at some hot conference. What matters is the impact you have had. Make it easy for me to use your ideas! Help yourself!
  4. Sharing raises your profile in industry. Having open source software makes your more attractive to software engineers.
  5. You write better software if you share it. While not all code I publish is bug-free, documented or even usable, I care slightly more about my code because I publish it.

Finally, does sharing code works? Do people download and use my software? Here are download statistics for my latest source-code publications:

A compressed alternative to the Java BitSet class >280
Rolling Hash C++ Library >200
Lemur Bitmap Index C++ Library >2 000
Fast Nearest-Neighbor Retrieval under the Dynamic Time Warping >1400

Related reading: Good prototyping software and The challenge of doing good experimental work by Suresh Venkatasubramanian. And More on algorithms and implementation by Michael Mitzenmacher.

Update: Joachim Wuttke pointed out another potential benefit: your users will debug your code.

Update 2: This post appeared on slashdot.

Trading latency for quality in research

I am not opposed to the Publish or Perish mantra. I am an academic writer. I am what I publish. We all think of researchers as people wearing laboratory coats, working on exotic devices. And my own laboratory includes a one-million-dollar computer cluster with a SAN server as large as a fridge. I also generate much software. But you know what? The writing is what matters.

And publishing is easy. Write and submit many papers  conforming to the expectations of the editors. Eventually, some of your work will be accepted. And there are thousands of journals, conferences and workshops. Just write a lot.

Yet, don’t publish everything you write—even when what you wrote looks like a research paper. Hold on to it.  Because, publishing everything that looks like a research paper leads to what Feynman famously described as Cargo Cult Science. Indeed, there is a real danger that we become so good at faking science that we are no longer doing science at all! We become dishonest.

In our haste to be published…

  • we cut corners in our experiments, when we validate our ideas at all;
  • we pretend that our work is applicable in the real world, when it isn’t;
  • we don’t take the time to reproduce and reflect on known results;
  • we give the positive aspects of our research while omitting to mention the negatives;
  • we complexify the issues so that our research looks fancier;
  • we get lost in abstract nonsense.

If you want your work to really matter, you should be honest. You should not fool yourself and others. So what do we do? Maybe we should publish carefully. While barely reducing our output rate as academic writers, we can introduce extra steps to keep us more honest. What do we need?

  • Diverse point of views: it is easy to fool a small group of like-minded experts, but comparatively more difficult to fool the readers of my blog.
  • Time to reflect: if you read what you wrote months ago, and you don’t feel the urgency to communicate it more broadly, maybe it wasn’t all that good to begin with?

The problem is that once a paper is published in a journal or a conference, we tend to move on. Anyhow, we cannot easily revise our published work. Are there other models? Economists regularly publish working papers—commonly known in Computer Science as technical reports. But the difference between computer scientists and economists is that economists revise their working papers. And only when their work has stood the test of time, that is, has been available freely for months or years, do they submit it to conventional peer review.

This year, I will try the following experiment. Both on this blog and on my publication page, I will “publish” working papers and specifically ask readers to be critical of my work. Only after a couple of months have passed (or more) will I submit my work to a journal or conference.

This will introduce some latency in my publication output. Can I trade latency for quality? I plan to report back in a year on this (very public) experiment.

Further reading: Time for computer science to grow up by Lance Fortnow.

Where to get your ebooks?

If you read my blog, you probably like to read in general. Thus, if you don’t own an ebook device, you will soon. The choice is growing: the Amazon Kindle, the Sony Reader, the Apple iPad,… I bought a kindle because my wife won’t let me fill the house with books. And I hate to throw away perfectly good paper books.

Amazon has most of the market for now. Yet, using the kindle store—on the kindle—is painful. Moreover, Amazon ebooks are protected by Digital Right Management (DRM). Amazon sells you crippled ebooks that can stop working if you copy them too often. There are often better alternatives elsewhere.

And, in Canada, there is a two-dollar surcharge for every wireless download using the Kindle. Since most ebooks are 0.5MB or less, the wireless costs 4$ per megabyte! This is insulting! Moreover, if you buy a book by mistake—which is annoying common—Amazon will reimburse the cost of the book itself, but not the fee for the wireless download.

Thankfully, you can grab books compatible with the kindle (in Mobipocket format) elsewhere. Then you can drop the file on the kindle using the USB port.

  • You can get nearly 2000 of the great French classic for free on ebookgratuits. This include a large fraction of the work of Honoré de Balzac.
  • Project Gutenberg offers 30,000 free e-books in various languages (mostly English).
  • WebScription sells DRM-free ebooks in various format. Most books fall into the scifi, young adults and fantasy genres.

I am currently reading You’re Not Fooling Anyone When You Take Your Laptop to a Coffee Shop by Scalzi. I bought it at WebScription for six dollars. It is a compilation of Scalzi’s blog posts on his life as a writer. I am fascinated by how much it ressembles my own life. Well… Except for the fact that I don’t get paid when I publish a paper. Maybe I should put together a compilation of posts about my silly work life. Would anyone buy it for six dollars?

I am also reading Halting State by Stross which I bought on Amazon for ten dollars. I haven’t yet gotten into the mood of the novel.

Further reading:

  • According to a Computer Scientist, the iPad could make Computer Science obsolete.
  • While I don’t think academic journals will be available on the Kindle any time soon, I think that has mostly to do with how insane academic publishing is.

Getting serious about online teaching

Earlier this month, Michael Mitzenmacher told us about the record number of students attending his Harvard class online-only. Yesterday, Dick Lipton predicted that online learning will replace campus learning : “I see no reason that On [Online Universities] could not do as good a job as Un [Campus Universities] with this basic goal [Educate Students].” In the comments, Lipton questions the importance of credentials and whether social interactions really need the campus.

I have already written much on the topic but let me reiterate my message:

  • In this new online world, professors are not content providers. They provide structure and motivation. They are role models. And most importantly, by their reputation, professors can provide certification. If someone gets a reference letter from Michael Mitzenmacher or Dick Lipton, I trust they know something about Computer Science, because I trust Michael Mitzenmacher and Dick Lipton. I suspect it is not easy to get these fellows to write fake reference letters because they have a high degree of independence (job security, good money, and so on) and their greatest asset is their reputation.
  • Students are trained to expect classrooms. Many students need structure and constant attention. That is not a good thing! We are effectively training students to be good employees working in large organizations with much structure. Yet, this world made of large and stable organizations has already fallen apart. We urgently need to teach students to learn on their own, using the Web.
  • Yes, there will always be campus classes, the same way there will always be physical libraries with actual books, and newspapers printed on paper.

Further reading:

You know your research is original when…

Many consider Frank Hebert’s Dune the most important work of science-fiction ever written. Consider that Star Wars is just a variation on Dune. Yet, it was rejected by more than twenty publishers, before being finally published. It is likely that publishers rejected Dune precisely because it was such a radical departure for the genre.

Of course, being rejected does not mean you are original. It could also mean that you are sloppy or uninteresting. However, there may be valid indications of your originality such as:

  • You have no competitor. Nobody quite does what you do.
  • You cannot be scooped. You read new issues of journals looking for fresh ideas, but without fear that someone made you irrelevant.

As MacLeod put it: Don’t try to stand out from the crowd; avoid crowds altogether.

Further reading: The secret behind radical innovation and A recipe for interesting Computer Science research papers.

Writing tools to improve your research productivity

Researchers—at least in Computer Science—spend most of their days at a desk typing. Picking the right software for writing is important.

Most of my writing time is spent on LaTeX documents. I have tried typical Word processors in the past, but they get in my way. Indeed, by mixing document content and document presentation, Microsoft Word makes it difficult to maintain consistency. Word is meant for short-lived (or throw-away) business documents. It is easy to get started and get 80% of the job done with Word. However, as the document gains complexity, as the number of revisions grow, as the number of collaborators expands, Microsoft Word becomes inadequate.

Oh! I still use OpenOffice or Google Docs to produce quick-and-dirty documents. But for anything that is meant to have lasting value, that is research, I refuse to fall into the Word processor trap. It causes some friction with colleagues, but it is a price I am willing to pay.

I believe every single graduate student should learn to write without a word processor. And serious science students should learn LateX. Even if you do not care for LaTeX, at least explore alternatives to Word such as Scrivener.

In any case, you are unlikely to need more than a text editor to write your prose:  Charles Stross, one of the best scifi writer alive, wrote many of his novels using a primitive text editor (Vim). If you have never written without Microsoft Word, how do you know that Word is not holding you back?

Right now, I write using a regular text editor (Smultron for MacOS) and the TeX Live 2009 distribution. I save all my documents to a subversion tree. Using a version control tool such as Subversion makes collaboration easy, and it allows me to go back in time years ago. It is a good setup.

Programming is also a form of writing. For my experimental work, I program in C++, Java or Python, often using Eclipse. I find it is slightly better for programming than my standard writing setup (using only a text editor). Eclipse has great qualities:

  • It stays out of the way. In particular, you can collaborate with people who are not using Eclipse without any problem.  For example, it will happily let you use handcrafted makefiles to compile your C++ programs.
  • It offers incremental compilation of Java programs. Basically, it compiles as you type.
  • It suggests corrections for many common compilation errors.

Essentially, while Java is still an awful language, Java with Eclipse is almost fun. Eclipse proves that sophisticated software can be helpful to programmers and writers.

Writing is hard and it will always be hard, no matter the tool. But at least, ease your pain!

See also Physical tools to improve research productivity.

The fundamental properties of computing

Physics works with fundamental properties such as mass, speed, acceleration, energy, and so on. Quantum mechanics has a well known trade-off between position and momentum: you can know where I am, or how fast I am going, but not both at the same time.

Algorithms (and their implementations) also have fundamental properties. Running time and memory usage are the obvious ones. In practice, there is often a trade-off between memory usage and the running time: you can a low memory usage, or a short running time, but not both. Michael Mitzenmacher reminded me this morning of another: correctness. On some difficult problems, you can get a low memory usage and a short running time if you accept an approximate solution.

I believe there are other fundamental properties like latency. Consider problems where the volume of the solution and of the input is large: statistics, image processing, finding some subgraph or sublist, text compression, and so on. In such instances, the solution comes out as a stream. You can measure the delay between the input and the output. For example, a program that compresses text by first scanning the whole text might have high latency, even if the overall running time is not large. Similarly, we can give the illusion that a Web browser is faster by beginning the Web page rendering faster, even if the overall running time of the rendering is the same. As another example, I once wrote a paper on computing the running maximum/minimum of an array where latency was an issue.

It would be interesting to come up with a listing of all the fundamental properties of computing.

The end of ‘mass universities’

In the late sixties and seventies, we wanted universities to become more accessible. We founded the Open University, the Université du Québec, and many other universities with accessibility as part of their mandate.

The stated goal was to make degrees more accessible. We succeeded.

Yet, we are now facing an intriguing paradox due to this success. Technology, by making access easier than ever to access educational content, is also shaking the very foundation of the University. As an example of this transformation,  Michael Nielsen was pointing out this morning that you can watch 120 hours of lectures on Physics by Lenny Susskind, for free on YouTube. You are in deep trouble if what you are selling in 2009 are mass-produced lectures. The market price just went through the floor.

Lance Fortnow pointed us to a short essay by Martin Rees about technology and universities. Rees’ point is that technology creates a more level playing field as far as location is concerned. A hundred years ago, airplanes made it possible for Indian Mathematicians to travel to Cambridge where they could be taken seriously. In some sense, airplanes made Indian Mathematicians more globally competitive, though only marginally so. The Web—with repositories such as arXiv—pushes this idea further, an order of magnitude further. After all, Gregori Perelman won a million dollar and the equivalent of a Nobel prize by posting a few papers on arXiv.

The revolution is all around us, not just in Science. Recently, an unknown writer, Sam Landstrom, posted his novel MetaGame on the Amazon Kindle. No publisher, no ad campain. Sales rank of his novel? 540. Considering that Amazon reported selling more ebooks than paper books over Christmas, I am sure many authors envy Landstrom success. Yet, Landstrom did not need an office New York City to either write or publish his book. For all I know, he lives in his parents’ basement.

Thankfully, bona fide Universities have some form of monopoly on University degrees. Yet, like Rees, I think that we are coming to the end of the road for mass universities:

Traditional universities will survive insofar as they offer mentoring and personal contact to their students. But it’s less clear that there will be a future for the ‘mass university’ where the students are offered little more than a passive role in lectures (generally of mediocre quality) with minimal feedback.

One thing is clear to me: The value of a lecture in front of 80 students—or the equivalent as a webcasted show—is exactly zero. (From an educational point of view.)

Actual programming with HTML and CSS (without javascript)

I usually stick with academic or research issues, but today, I wanted to have some fun. Geek fun.

While W3C describes Cascading Style Sheets (CSS) as a mechanism for adding style (e.g. fonts, colors, spacing) to Web documents, it is also a bona fide programming language. In fact, it is one of the most widespread declarative languages.

But how powerful is CSS? It is not Turing complete, so there is no hope of programming full applications.  However, CSS may be surprisingly powerful. (Update: It turns out that HTML+CSS is Turing complete!)

Suppose you are given a plain HTML table and you want to add counters and colors to it. Starting with this table:

<table>
<tr><th>City</th><th>Color</th></tr>
<tr><td>Montreal</td><td>Red</td></tr>
<tr><td>Toronto</td><td>Blue</td></tr>
<tr><td>Vancouver</td><td>Yellow</td></tr>
</table>

We want to color rows in alternative colors. We need counters and a way to color the second line differently.

Solution: Add the following lines to your CSS style sheet:

tr{counter-increment: mycounter}
table {counter-reset: mycounter -1}
td:first-child:before {content: counter(mycounter)". " }
tr:nth-child(2n+2) td {background-color: #ccc;}
tr:nth-child(2n+3) td {background-color: #ddd;}

Database Questions for 2010: What’s On My Mind

I started 2009 with an interest in Web  2.0 OLAP and collaborative data processing. The field of collaborative data processing has progressed tremendously. Last year, we got Google Fusion Tables and data warehousing products are getting more collaborative.

In 2010, my research might focus more on database theory—while maintaining a strong experimental bias. Specifically, I am currently thinking about:

  • Lightweight compression. The goal of lightweight compression is save CPU cycles—not storage! Of course, the CPU architecture is critical. Thus, you have to worry about instruction-level parallelism. Measuring the quality of the compression by the compression ratio is wrong.
  • Row reordering. Some compression formats, such as run-length encoding, are sensitive to the order of the objects in a database. Reordering them is NP-hard. The efficiency of the heuristics depends on the compression format. I will continue my earlier work on this topic.
  • Concurrency and parallelism. Some believe that multicore CPUs can be used to compress data even more aggressively. It might be misguided. Instead, we must focus on embarrassingly parallel problems. Already, we can scan a large in-memory table using several CPU cores quite fast. In 2010, we should empower our users so that they can explore their data more freely.
  • String hashing. I have argued on this blog that universal hashing of long strings is impossible. While hashing strings is textbook material, our understanding of hashing can be improved further.

Further reading: Search Questions for 2010: What’s On My Mind by Daniel Tunkelang