Some useful regular expressions for programmers

In my blog post, My programming setup, I stressed how important regular expressions are to my programming activities.

Regular expressions can look intimidating and outright ugly. However, they should not be underestimated.

Someone asked for examples of regular expressions that I rely upon. Here a few.

  1. It is commonly considered a faux pas to include ‘trailing white space’ in code. That is, your lines should end with the line-return control characters and nothing else. In a regular expression, the end of the string (or line) is marked by the ‘$’ symbol, and a white-space can be indicated with ‘\s’, and a sequence of one or more white space is ‘\s+’. Thus if I search for ‘\s+$‘, I will locate all offending lines.
  2. It is often best to avoid non-ASCII characters in source code. Indeed, in some cases, there is no standard way to tell the compiler about your character encoding, so non-ASCII might trigger problems. To check all non-ASCII characters, you may do [^\x00-\x7F].
  3. Sometimes you insert too many spaces between a variable or an operator. Multiple spaces are fine at the start of a line, since they can be used for indentation, but other repeated spaces are usually in error. You can check for them with the expression \b\s{2,}. The \b indicate a word boundary.
  4. I use spaces to indent my code, but I always use an even number of spaces (2, 4, 8, etc.). Yet I might get it wrong and insert an odd number of spaces in some places. To detect these cases, I use the expression ^(\s\s)*\s[^\s]. To delete the extra space, I can select it with look-ahead and look-behind expressions such as (?<=^(\s\s)*)\s(?=[^\s]).
  5. I do not want a space after the opening parenthesis nor before the closing parenthesis. I can check for such a case with (\(\s|\s\)). If I want to remove the spaces, I can detect them with a look-behind expression such as (?<=\()\s.
  6. Suppose that I want to identify all instances of a variable, I can search for \bmyname\b. By using word boundaries, I ensure that I do not catch instances of the string inside other functions or variable names. Similarly, if I want to select all variable that end with some expression, I can do it with an expression like \b\w*myname\b.

The great thing with regular expressions is how widely applicable they are.

Many of my examples have to do with code reformatting. Some people wonder why I do not simply use code reformatters. I do use such tools all of the time, but they are not always a good option. If you are going to work with other people who have other preferences regarding code formatting, you do not want to trigger hundreds of formatting changes just to contribute a new function. It is a major faux pas to do so. Hence you often need to keep your reformatting in check.

Scientists and central planning

Overconfident individuals often win by claiming more resources than they could defend (Johnson and Fowler). If nobody knows who is strongest, whoever thinks he is the strongest might win by default. That is, there is no better way to fool others than to first fool yourself.

Accordingly, human beings are often overconfident:

  • The teachers know that lecturing is highly effective (even when it is not).
  • Sex partners know that they are great lovers (even when they are not).
  • Star scientists know that they are more often right than others (even if that’s not true).
  • Economists know how they economy work (even when they don’t).
  • The entrepreneur knows his new business will succeed (even when most fails).

In scholarship, overconfidence is a great asset. I routinely fool myself into thinking that I can make a lasting contribution  with my next research projects. By being overconfident, I take risks that I would not take otherwise. If I were more cautious, I would go work in the financial industry, make a lot of money and retire early. Instead, I keep thinking that my next research project might advance Science.

Thus, scientists often develop a God complex:  they somehow know that whatever theories they hold must be true. They also believe that they can accurately assess the significance of the work of others. This is not entirely conscious, of course. They know that it is impossible to make fair assessments. Yet they fool themselves long enough to believe that their assessments are largely correct (even when they are not).

This is somewhat ironic in that Science’s ultimate purpose is to tame our God complex. Yet the very people who pursue science have gigantic egos. They would never admit that Science is built on trial and error. They want you to believe that Science is driven by superior intellects. How can such pompous people be in charge of moderating our God complex? Because while the scientist might be driven by overconfidence, his tools keep him honest. The scientist is like a high speed train: he might have a huge engine, he is (mostly) forced to follow tracks. A scientist operating entirely based on his inflated ego will eventually be derailed.

The God complex leads us straight to the conventional peer review system: write a paper, send it to a journal where a handful of review will assess it. Can reviewers really tell whether the work is significant? Of course, they can’t! But they believe so, for a time. The scientific way is to try and verify. That is how, ultimately, scientists deal with their God complex. Hence all (apparently) correct research papers should be published. Work that has stood the test of time should form the foundation of a field, not work that has been selected by  (false) Gods.

But surely, scientists cannot try all possible theories? They must someone select the most promising directions. However, once we acknowledge the God complex, some natural strategies emerge:

  1. We should reject central planning. We must reject the authority of the few to decide the worthwhile research directions. We should choose to support many small projects rather than fund one large initiative driven by a handful of individuals. Results matter, not authority.
  2. We should test as many different ideas as we can. Scientists should be encouraged to try many different things.

If you think that these strategies sound sensible, consider that they are almost entirely the opposite of what is being done:

  1. Science is increasingly planned centrally around large government bureaucracies. We want to build scientific heroes who will direct other (lesser) scientists. Large teams focus on a narrow set of problems and directed by experienced researchers are the gold standard. Independent thinkers who try different things “waste resources”.
  2. We encourage specialization. One argument which is often raised is that if a professor tries his hand at a new topic, he will be unable to train new Ph.D.s in this new area, thus limiting our ability to produce more Ph.D.s. It is also frowned upon to try an idea, see it fail, and write up the results as a research report: every project should lead to a publication in a respected venue.

By analogy between Science and capitalism, it is as if  the government was funding large corporations at the expense of small shops. It is as if the government funded companies that are highly specific and inflexible at the expense of agile ventures because more stable companies can better train their employees. And, more importantly, it is as if the fate of start-up companies was determined by small secret committees of experts: unless such a committee concludes that the company will work, it cannot launch. It is as if we believed it best to direct the consumer to a few trusted companies, for fear that he might choose to deal with a young unpolished start-up.

In effect, the modern scientist has rejected the free market of ideas. He prefers state-directed communism where a few choose for the masses.

Wikipedia opportunity: There is no entry in Wikipedia for Garage science where independent researchers do bona fide research with modest means.

Further reading: The perils of filter-then-publish and Improve your impact with abundance-based design.

Update: The original title of this post was “scientists are communists”.

Sentience is indescribable

Arguably, one of the most nagging scientific question is the nature of sentience. Can we build sentient computers? Is my cat sentient? What does that mean? Will a breakthrough in cognitive science tell us what are consciousness, sentience and free will?

I conjecture that these topics will forever escape us, at least in part.

Near where I live, there is a forest. I can recognize this forest. If you were to drop me asleep in it, I would immediately recognize it. I would recognize the way the trees have grown, the sounds, the smell, the species… But I could never “explain” this forest. That is, I cannot compress down my experience of this forest to a coherent document that I could share. My forest is indescribable: its entropy is too high.

My brain is limited. I can only describe simple structures with a degree of complexity far lower than that of my brain. My brain cannot describe itself. Software appears sentient (though maybe not sapient) when its entropy becomes comparable to our own brain. To me, Gmail’s spam filter appears sentient. I know the science behind Gmail’s spam filter. It uses some kind of Bayes classifier. But that’s like saying that the brain is made of neurons.

Of course, software can run other software (e.g., your browser runs JavaScript). Similarly, my brain can predict what my wife will say under some circumstances (mostly when I screw up). But if I were to lay down on paper how I manage to predict my wife’s actions, the result would be illegible: I cannot communicate my understanding to another brain.

So, are my computer and my cat sentient? By this definition: absolutely. They are not sapient, that is, they cannot pass for human beings, but I cannot describe how they work. I can merely describe how some parts of them work and make some limited predictions. For example, I can tell how a CPU work, but my description is going to quickly become fuzzy: I am constantly puzzled by how superscalar processors deal with computations. They reorder the instructions in ways that are not intuitive to me. At least as far as my brain is concerned, it is magic! Similarly, a forest is sentient. Earth is sentient.

Of course, this definition of sentience is evolving. A simple engine may be indescribable (magic!) at some point, and then later perfectly describable after some training in mechanics. But as long as I can, eventually, understand how the engine works and communicate my understanding, then I have shown that its complexity is sufficiently below that of our brains.

If you accept my point of view, then it has some consequences for morality. Some say that sentient beings deserve respect. For example, you should not own sentient beings. Yet if sentience is nothing special, but merely a computing system with entropy approaching our own, then why should they deserve special consideration? Perhaps we just want to cling to the belief that sentience is somehow special.

Improve your impact with abundance-based design

People design all the time: new cars, new software, new houses. All design is guided by constraints (cost, time, materials, space) and by objectives (elegance, quality). Constraints are limitations: you only have so much money, so many days… whereas objectives are measures that you seek to either maximize or minimize. In practice, either the constraints or the objectives may dominate. You are either worried about limited ressources, or you seek to maximize the quality of your result.

Our ancestors were probably often forced into scarcity-based design. When your very survival is in question, you build whatever shelter you can in the hours you have left before nightfall. We are probably wired for good scarcity-based design as it is a survival trait.

Any monkey can live in scarcity. However, abundance-based design is crucial if you want to maximize your impact.

Facebook engineers do abundance-based design. They are mainly worried about improving Facebook and pursuing objectives such as usability, but much less worried about time or disk space. Similarly, when I build a model sailboat, costs and time are nearly irrelevant, I mostly care that my boat be pretty and that it handles well. As a researcher, most of my research papers are the result of abundance-based design. It does not matter how long I work on the research projects, as long as the result has impact. Similarly, my blog is the result of abundance-based design. Nobody is forcing me to write on a regular schedule. And I have no set limit on the time I spend on my blog.

Many people choose to simulate scarcity-based design, maybe because it comes with an adrenaline rush. In fact, the adrenaline rush is good indication that you are in scarcity mode. You will often hear scarcity-based designers say that they are running of time, money or space. They may spend much time planning or worrying about costs and deadlines. There are many examples of artificial scarcity-based design:

  • One of the great fallacies of software engineering is that what matters in the software industry is how long it takes and how much it costs. But anyone who has been in the software industry long enough knows that the real problem is that most software is bad. Some of it is atrocious. For example, Apple iTunes is a disgrace.  I don’t care whether the iTunes team finished on time and within budget. Their software is crap. They failed as far as I am concerned.
  • Nobody cares how long it took  you to write your novel or research paper. Yet people sign deals with publishers with fixed deadlines and others choose to publish in conferences with fixed deadline. They create external pressure, on purpose.

Frankly, if you are a designer such as an artist, a fiction writer, a scientist or a scholar, you should have a feeling of urgency, not worry. A single strategy may suffice to put you in abundance mode:

  • Reduce the quantity. Apple is well known for having few products. Despite having billions of dollars, they focus on few projects. And their new project have often fewer features than the competition. By focusing your attention, you ensure abundance. Don’t start more projects than you can’t execute with ease.

Further reading: Publishing for Impact by John Regehr and The merits of chasing many rabbits at the same time by Alain Désilets.

Is science more art or industry?

picture by bdesham
In my previous post, I argued that people who pursue double-blind peer review have an idealized “LEGO block” view of scientific research. Research papers are “pure” units of knowledge and who wrote them is irrelevant.

Let us take this LEGO block view to its ultimate conclusion.

If science is pure industry, producing standardized elements—called research papers, why should papers be signed as if they were pieces of art? The signature is obviously irrelevant. Nobody cares who made a given LEGO block. Thus, I propose we omit names from research papers. It should not change anything, and it will be fairer.

Indeed, why not have anonymous papers all the way? Journals could publish articles without ever telling us who they are from. We would ignore, for example, which papers were written by Einstein or Turing. How is that relevant? How does it help us to appreciate a given paper to know it was written by Turing?

What would we do for conferences? Because papers are standard units, people could attend conferences and be assigned a paper, any paper, to present. Presenting your own work is a bit too egotistical anyhow.

Of course, for recruiting or promotion purposes, we would need to be able to map research papers to individuals. But, because papers are standard units, all you care about is the number of papers and related statistics. Thus, an academic c.v. would not list research papers, but instead provide a key that could be used to retrieve productivity statistics.

Of course, this is not, even at a first approximation, how science works. Science is more art than industry. That is why we put our names on research papers. It does matter that it is Turing that wrote a given paper. It helps us understand the paper better to know its author, its date and its context. When I receive a paper to review, I try to see how the authors work, what their biases are.

Research papers present a view of the world. But like Plato’s cave, this view is fundamentally incomplete. If a paper report the results from some experiments they conducted, the paper is not these experiments: it is only a view on these experiments. It is necessarily a biased view. Do you know what the biases of these particular authors are?

Let us be candid here. When reviewing research papers, there is no such thing as objectivity. Some papers are interesting to the reviewer, some aren’t. What makes it interesting has to do with whether the world view presented is compatible with the reviewer’s world view. And because different individuals have (or should have) different world views, it does matter who wrote the paper even if we omit names. It helps me to find your paper interesting if I can put myself in your shoes, get to know who you are. An anonymous paper is far more likely to be boring to me, because it is hard to have empathy for the authors.

Some days, we all wish it did not matter who we are. Can’t people just look at our work on its own? You can get your wish by becoming a bureaucrat or a factory worker. Science is for people who want to see their name in print, people who want to build their reputation and cater to their inflated ego. In short, good science is interesting.

Ten things Computer Science tells us about bureaucrats

Originally, the term computer applied to human beings. These days, it is increasingly difficult to distinguish reliably machines from human beings: we require ever more challenging CAPTCHAs.

Machines are getting so good that I now prefer dealing with computers than bureaucrats. I much prefer to pay my taxes electronically, for example. Bureaucrats are rarely updated, and they tend to require constant attention like aging servers.

In any case, a bureaucracy is certainly an information processing “machine”. If each bureaucrat is a computer, then the bureaucracy is a computer network. What does Computer Science tell us about bureaucrats?

  1. Bureaucracies are subject to the halting problem. That is, when facing a new problem, it is impossible to know whether the bureaucracy will ever find a solution. Have you ever wondered when the meeting would end? It may never end.
  2. Brewer’s theorem tell us that you cannot have consistency, availability and partition tolerance in a bureaucracy. For example, accounting departments freeze everything once a year. This unavailability is required to achieve yearly consistency.
  3. Parallel computing is hard. You may think that splitting the work between ten bureaucrats would make it go ten times faster, but you are lucky if it goes faster at all.
  4. One the cheapest way to improve the speed of a bureaucracy is caching. Keep track of what worked in the past. Keep your old forms and modify them instead of starting from scratch.
  5. Pipelining is another great trick to improve performance. Instead of having bureaucrats finish the entire processing before they pass on the result, have them pass on their completed work as they finish it. If you have a long chain of bureaucrats, you can drastically speed up the processing.
  6. Code refactoring often fails to improve efficiency. Correspondingly, shuffling a bureaucracy is just for show: it often fails to improve productivity.
  7. Bureaucratic processes spend 80% of their time with 20% of the bureaucrats. Optimize them out.
  8. Know your data structures: a good organigram should be a balanced tree.
  9. When an exception occurs, it goes back the ranks until a manager can handle it. If the CEO cannot handle it, then the whole organization will crash.
  10. The computational complexity is often determined by looking at the loops. That is where your code will spend most of its time. In a bureaucracy, most of the work is repetitive.

Update: Neal Lathia commented that neither bureaucrats nor computers understand humor.

Update: “This is a fairly well-known model, and no it isn’t computer science that is at the root of what you are noticing. It is early operations research. Taylorism in fact. There was a conscious effort in the 20s and 30s to bring Taylorist style a…ssembly line/operations research thinking into white collar work, starting with organizing pools of typists, secretaries and other office workers the same way banks of machine tools were organized into flow shops and assembly lines. The exact same Taylorist time-and-motion study tools were applied (in fact, in the 30s this was so popular that women’s magazines carried articles about time-and-motion in the kitchen. Example: puzzles like “what’s the fastest way to toast 3 slices of bread on a pan that can hold 2 and toast 1 side at a time?) Computer science itself was initially strongly influenced by shopfloor OR… that’s where metaphors like queues come from after all.” (Venkatesh Rao)

Know the biases of your operating system

Douglas Rushkoff wrote in Life Inc. that our society is nothing more than an operating system upon which we (as software) live:

The landscape on which we are living “the operating system on which we are now running our social software” was invented by people, sold to us as a better way of life, supported by myths, and ultimately allowed to develop into a self-sustaining reality.

In turn, operating systems are designed and maintained by engineers who make choices and have biases. He makes us realize that corporations, these virtual beings which live forever and are granted full privileges (including free speech), are not natural but are bona fide inventions. He also stresses that central currencies, that is, the concept that the state must have a monopoly on the currency, is also an invention: why is it illegal to switch to an alternative currency in most countries?

We fail to see these things, or rather, we take them for granted because they are our operating system. Someone used to Microsoft Windows takes for granted that a desktop computer must behave like Microsoft Windows: they cannot suffer MacOS or Linux, at least initially, because it feels instinctively wrong. Anyone, like myself, who uses non-Microsoft operating systems in a predominantly Microsoft organization is constantly exposing hidden assumptions. “No, my document was not written using Microsoft Word.”

Science has an operating system as well. One of its building block is traditional peer review: you submit a research paper to an editor who picks a few respected colleagues who, in turn, advise him on whether your work is valid or not. By convention, any work which did not undergo this process is suspect. In Three myths about peer review, Michael Nielsen reminded us that traditional peer review is not a long tradition, and is not how correctness is assessed in science. Gregori Perelman by choosing to forgo traditional peer review while publishing some of the most important mathematical work of our generation could not have made Nielsen’s point stronger. Similarly, we believe that serious academics must publish books through a reputable publisher: self-publishing a book would be a sure sign that you are a crank. Years ago, scholars who had blogs were clowns (though this has changed). We also value greatly the training of new Ph.D. students, even when there is no evidence that the job market needs more doctors. We value greatly large research grants, even when they take away great researchers from what they like best (doing research) and turn them into what they hate doing (managing research). But nobody is willing to question this system because the alternative is unthinkable. “You mean that I could use something beside Microsoft Windows?”

In my previous post, I challenged public education. Some people even went so far as to admit that my post felt wrong. I suspect that this feeling is not unlike the feeling one gets when switching from Windows to Linux. “Where is Internet Explorer?”

Several people cannot imagine that you can become smart without a formal education which includes at least a high school diploma. It is not that the counter-examples are missing (there are plenty: Bobby Fischer, Walt Disney, James Bach and Richard Branson). It is simply hard to imagine that you could do away with brick-and-mortal schools and still have scholarship and intelligence. Similarly, we cannot imagine a world without corporations or without central currency, or science without formal peer review.

Challenging preconceived notions is difficult because your feelings will betray you. Radically new ideas feel wrong. The cure is to try to remember how it felt like when you were first exposed to these ideas. On this note, Andre Vellino pointed me to Disciplined Minds, a book so controversial that it got its author fired! It reminded me of my feelings as a student about exams, grades and teachers:

  • Exams and grades appear neutral: on the face of it, they are merit-based challenges. While in fact, they are really tests of conformity. To get good grades, you must organize much of your life around what others expect you to do. I cannot think of a good reason why most people would care about the integral of x3 cos(x). Why do we require such technical knowledge of so many people? The reason is simple: if you can set aside all other interests to learn calculus just because you are told to do so, then you are good at learning what you are told to learn. If you refuse to hand in an assignment because you think it is stupid, you will be punished. It does not matter if you use the free time to be even more productive on some valid scholarship.
  • Teachers appear unpolitical at a first glance. They teach commonly accepted facts to students. However, teachers are political because they never challenge the curriculum, and when they do, they are frequently fired. As a kid, I refused to learn my multiplication tables. I was repeatedly chastised for failing to memorize them: instead, I would design algorithms to quickly deduce the correct answer  without rote memorization. This was called cheating, and my teachers would wait for the small pause and then interrupt me: “you are cheating again, you have to memorize”. I still have not memorized my multiplication tables. Why is it that no teacher ever opposed the requirement that we memorize multiplication tables? Because their job involves teaching obedience.

So, the same way corporations and central currencies are not neutral, public education is not neutral. Kids are naturally curious. If you leave them alone, they will learn eagerly. Alas, they will also refuse to learn what you are telling them to learn. This is precisely what schools are meant to break.

Public education historically helped class mobility. Publicly funded scholar have also greatly contributed to our advancement. However, as the world is changing through increased automatization and globalization, we may need to drastically shift gear. Stephen Downes answered my previous post with a pointer to his essay Five key questions. In this essay, Downes offers a foundational principle for a renewed public education:

It represents a change of outlook from one where education is an essential service that much be provided to all persons, to one where the role of the public provider is overwhelmingly one of support and recognition for an individual’s own educational attainment. It represents an end to a centrally-defined determination of how an education can be obtained, to one that offers choices, resources and assessment.

Downes challenges conformity as a core value for education. Quite the opposite: he calls into question the idea that education should be “managed”. I believe he would agree that one of the great tragedy of public education is the centrally mandated curriculum. This was ideal preparation for the slow-moving corporations of the sixties and seventies. In 2011, why punish a kid who decides to spent five years building a robot?

Go ask your kid to name a planet. If his answer his Jupiter, Mars or Earth. Be worried. In the twenty-first century, we need kids who answer Eris or MakeMake.

Further reading: Brian Martin, Review of Jeff Schmidt’s Disciplined Minds: A Critical Look at Salaried Professionals and the Soul-Battering System that Shapes their Lives, Radical Teacher, No. 62, 2001, pp. 40-43.

Taking scientific publishing to the next level

Scientific publishing is wasteful. We spend much time perfecting irrelevant papers to get them through peer review. Meanwhile, important papers—that thousands of researchers will have to study—remain filled with errors or suffer from a suboptimal presentation. Surely, you have stumbled on an important paper and thought to yourself: this paper could use a couple of examples. Or maybe the important results are buried deep into irrelevant material because the authors did not know what was really significant when they wrote the paper. We patch the system by writing lecture notes and even textbooks, which are themselves obsolete soon after their publication.

We can do better:

  • Research papers should be subject to bug reports and feature requests. We need a bugzilla for research papers. It would cost nearly nothing, but it would dramatically improve the important research papers. Moreover, young researchers could build up a reputation: finding and reporting bugs in the literature is sign of leadership.
  • Research papers need versioning: the authors should revise their most important work, to fix bugs and improve the presentation. Important research papers should be perfected as much as possible. (Some open archives such as arXiv already have this function.)
  • When the authors are unwilling or unable to improve their important papers,  then someone should create a derivative paper. For example, a graduate student could take a classical research paper and rewrite it to fix bugs and improve the presentation. As a community, researchers should promote licensing which specifically allows this type of derivative work. Researchers should also publish documents that can be easily edited (LaTeX source code or Microsoft Word documents). Many small fixes do not warrant yet another (possibly irrelevant) research paper. We need to be able to go back and patch older work for everyone’s benefit.

Yes, I am effectively saying that we should consider research papers like Open Source software.

Further reading: The Journal Manifesto 2.0 by Bill Gasarch and my Simplified Open Publishing Manifesto.

Credit: This blog post was motivated by an email exchange with Daniel Gayo Avello.

Update: Michael Nielsen sent me a point to his essay Micropublication and open source research.

Turning vanity publishing on its head

It has never been easier to self-publish a book:

  • Amazon has CreateSpace which offers a print-on-demand service and an ISBN if you want one. Self-publishing on the Amazon Kindle store could not be easier.
  • Apple allows anyone to self-publish ebooks through its iTunes Connect service. Unfortunately, you need a Mac to run their software, and you must have an ISBN number. Thankfully, you can buy ISBN numbers online through vendors like Bowker.
  • Barnes and Nobel make it very easy to publish ebooks through their pubit! online service.
  • Borders offer a service of its own called Get Published.
  • Lulu offers print-on-demand and makes available both your paper and electronic books through Amazon.

Typically, self-publishing is associated with vanity publishing. Wikipedia defines vanity publishing as publishing at the author’s expense. So, is self-publishing really vanity publishing? Consider the Amazon kindle top-10 best-sellers. Three of these books are self-published books from Amanda Hocking. She can sell 10,000 books a week.

I believe that we are seeing a reversal: people who want prestige, but not necessarily readership or sales, want a bona fide publisher. Meanwhile, individuals who want to make money increasingly self-publish. In this respect, self-publishing is attractive: in many cases, you get to keep close to 70% of the sales as royalties.

What about academia? Some people believe that researchers get paid to publish articles. The opposite is true. Often, journals charge the authors. Thus, by definition, scientific publishing is vanity publishing. Some professors make money by publishing books, typically by publishing a popular textbook, but most do not.

Demarchy and probabilistic algorithms


Demarchy are political systems built using randomness. Demarchy has been used to designate political leaders in Ancient Greece and in France (during the French revolution). In many countries, jury selection is random. On Slashdot, the comment moderation system relies on randomly selected members: as you use the system, you increase your (digital) karma which increases the probability that you will become a moderator. Demarchy has been depicted by several science-fiction authors including Arthur C. Clarke, Kim Stanley Robinson, Ken MacLeod and Alastair Reynolds.

Democracy is deterministic: we send the representative with the most votes. Demarchy  is probabilistic: we need a source of random numbers to operate it. Unfortunately, probabilities are less intuitive than simple statistical measures like the most frequent item. It is also much more difficult to implement demarchy, at least without a computer. Meanwhile, most software uses probabilistic algorithms:

  • Databases use hash tables to recover values in expected constant time;
  • Cryptography is almost entirely about probabilities;
  • Most of Machine Learning is based on probabilistic models.

In fact, probabilistic algorithms work better than deterministic algorithms.

Some common objections to demarchy:

  • Not everyone is equally qualified to run a country or a city. My counter-argument: Demarchy does not imply that everyone has an equal chance of being among the leaders. Just like in a jury selection, some people are disqualified. Moreover, just like Slashdot, there is no reason to believe that everyone would get an equal chance. Active participants in the political life might receive higher “digital karma”. Of course, from time to time, incompetent leaders would be selected, by random chance. But we also get incompetent leaders in our democratic system.
  • There is too much to learn: the average Joe would be unable to cope. My counter-argument: Demarchy does not imply that a dictator for life is nominated, or that the leaders can make up all the rules. In the justice system, jury selection covers a short period of time and scope (one case), is closely supervised by an arbitrator (the judge), and typically involves several nominates (a dozen in Canada). And if qualifications are what matters, then why don’t we elect leaders based on their objective results to some tests? The truth is that many great leaders did not do so well on tests. Most notably, Churchill was terrible at Mathematics. Moreover, experts such as professional politicians are often overrated:  it has been claimed that picking stock at random is better than hiring an expert.
  • Demarchy nominates would be less moral than democratically elected leaders. My counter-argument: I think the opposite would happen. Democratic leaders often feel entitled to their special status. After all, they worked hard for it. Studies show that white collar criminals are not less moral than the rest of us, they just think that the rules don’t apply to them because they are special. Politics often attract highly ambitious (and highly entitled) people. Meanwhile, demarchy leaders would be under no illusion as to their status. Most people would see the nomination as a duty to fulfill. It is much easier for lobbyist to target professional politicians than short-lived demarchy nominates.

Further reading: Brian Martin’s writings on demarchy and democracy and How to choose? by Michael Schulson.

Our institutions are limited by the pre-digital technology

Much of our institutions are limited by the pre-digital technology: (1) It is difficult to constantly re-edit a paper book; (2) without computers, global trade requires perennial currencies ; (3) without information technology, any political system more fluid than representative democracy cannot scale up to millions of individuals. These embedded limitations still constrain us, for now.

  • We teach kids arithmetic and calculus, but systematically fail to teach them about probabilities. We are training them to distinguish truth from falsehoods, when most things are neither true nor false. Meanwhile, sites like Stack Overflow and Math Overflow expose these same kids to rigorous debates, much like those I imagine the Ancient Greeks having. We are exposing the inner workings of scholarship like never before. Instead of receiving knowledge from carefully crafted books and lectures, kids have to be trained to seek out truth on their own. The new generation will expect textbooks to editable, like Wikipedia. You might be 16, but you can still correct a professor. But  the 12 years old next door can correct you too. Are we still going to have static books, idealized in some “final version”? I think that books and journal articles will have to become dynamic, or they will grow irrelevant.
  • Why do we need central banks and currencies? Because they, alone, can ensure determinism and stability? Why can’t we have several competing currencies? In any case, how wealthy is Mark Zuckerberg? Financial determinism is an illusion. There may be a picture of the Queen on my dollar bills, but the authority of the Monarch does not protect the value of this currency. If not for their fear of upsetting local governments, VISA and Mastercard could create a new currency in an instant. If not for government regulations, Paypal would be competing with central banks. Suppose that I help a game company. Maybe they could pay me with their in-game currency. Many people worldwide would be willing to trade this currency against other goods and services. At no point do I need to print money! We could even automate the process. I get some in-game currency and I want to buy food, can a computer arrange trades so that I get bags of rice delivered at my home? Of course, it can. Obviously, governments would start regulating because, these sort of trades effectively cut the government out.  But I think it would be very difficult to outlaw or regulate these trades in a context where the financial markets are driven by algorithms.
  • We expect political leaders to represent the interest of an entire population. Have you ever stopped to consider how insane this expectation is? Would you trust a jury made of a single person? We have learned from machine learning that the best results were typically achieved with Ensemble Methods. What we need are sets of leaders, automatically selected and weighted. We don’t want the best leader, we want the best ensemble of leaders. And the best way to avoid biases is to change these leaders frequently. We almost have the technology to do it in a scalable manner. Electronic voting is only the first step toward a new, more fluid, form of politics. If it sounds crazy, consider that people spend more and more time online, where politics is very different than in our brick-and-mortal world. The models emerging online will be implemented in offline politics. That is where the political innovation will occur.
Currency Mathematics Politics Knowledge
Primitive None One, Two or Many Tribe Stories
Moderate Local trade Arithmetic and geometry Monarchy Manuscripts
Advanced Central currencies Calculus Democracy Books, Authoritative Science
Upcoming Distributed and electronic Bayesian Demarchy Open and Participative Scholarship

So, you want to be a mad scientist?

Exceptional scientists are often a bit crazy:

  • Kurt Gödel starved to death when his wife was hospitalized. He was paranoid.
  • John Nash suffered from paranoid schizophrenia.
  • Paul Erdõs was a homeless itinerant most of his life.
  • Henry Cavendish was so shy that he only communicated with his servants by writing notes.
  • Theodore Kaczynski (the Unabomber) became assistant professor at the University of California at Berkeley at the age of 25, before going to live in a cabin without electricity or running water.
  • Nikola Tesla was obsessive compulsive and mysophobic.

Further reading: Scientists and their emotions.

Source: tvtropes via Daniel Lowd.

Three of my all-time most popular blog posts

  • Emotions killing your intellectual productivity: We all have to deal with setbacks. And even when things go our way, we can still remain frustrated. I offer pointers on how to remain productive despite your emotional state.
  • Turn your weaknesses into strengths: We all have weaknesses. Maybe you are unemployed. Or maybe you failed at getting a research grant. Maybe you live in a remote or poor area. I think that, within reason, many of your weaknesses can actually play in your favor if you adapt your strategy.
  • How reliable is science?: I got a lot of heat for this blog post. Basically, I believe that the business of science is unreliable. And I am not alone. The Nobel laureate Harry Kroto wrote: “The peer-review system is the most ludicrous system ever devised. It is useless and does not make sense (…)”. I couldn’t agree more as my post The hard truth about research grants makes clear.

Remarkable scientists without a wikipedia page

I was surprised today to learn that Michael Ley’s wikipedia page had been deleted (because it failed to indicate the significance of the subject). I have yet to meet anyone in Computer Science or Information Technology who does not know about the DBLP Computer Science Bibliography. Michael has received numerous prestigious awards for his work. He is a remarkable pioneer.

But there are other remarkable people without a wikipedia page. Patrick O’Neil is another good example. Consider the citations that some of his papers received (according to Google Scholar):

  • The dangers of replication and a solution: cited 1036 times;
  • The LRU-K page replacement algorithm for database disk buffering: cited 515 times;
  • Improved query performance with variant indexes: cited 421 times;
  • ORDPATHs: insert-friendly XML node labels: cited 273 times.

Chances are that if you have ever used a database engine at all, it implemented an algorithm or a technique related to the O’Neil’s work.

How many other remarkable scientists don’t have a wikipedia page?

Update: Thanks to Ragib Hasan and David Eppstein, these two computer scientists now have wikipedia pages (see Ley and O’Neil).

Why you may not like your job, even though everyone envies you

In a provoking post, Matt Welsh, a successful tenured professor at Harvard, left his academic job for an industry position. It created a serious malaise: his department chair (Michael Mitzenmacher) wrote a counterpoint answering the improbable question: “why I’m staying at Harvard?” To my knowledge, it was the first time a departement chair from a prestigious university answered such a question publicly. Michael went even as far as arguing that, yes, indeed, he could get a job elsewhere. These questions are crazy if we consider that for every job advertised at Harvard, there are probably hundreds of highly qualified applicants.

But let me get back at Matt’s reason for leaving a confortable and prestigious job at Harvard:

(…) all of that extra work only takes away from time spent building systems, which is what I really want to be doing. (…) At Google, I have a much more direct route from idea to execution to impact. I can just sit down and write the code and deploy the system, on more machines than I will ever have access to at a university. I personally find this far more satisfying than the elaborate academic process.

In other words, Matt is happier when his work is more immediately useful. But where does the malaise about his decision comes from? After all, he will probably make as much or even more money at Google. Matt is not alone, by the way, Matthew Crawford, a Ph.D. in philosophy, left a high paying job in an American think tank for a job repairing motor bikes. His book Shop Class as Soulcraft tells his story.

I think that Matt’s decision might be hard to understand, at least, his department chair feels the need to explain it to us because he is putting into question the very core values of our society. These core values were explored by Veblen in his unconventional book The Theory of the Leisure Class. He argued that we are not driven by utility, but rather by social status. In fact, our society pushes us to seek high prestige jobs, rather than useful and productive jobs. In effect, a job doing research in Computer Science is more prestigious than an industry job building real systems, on the mere account that it is less immediately useful. Here are some other examples:

  • The electrician who comes and wires your house has a less prestigious job than the electrical engineer who manages vague projects within a large organization.
  • The programmer who outputs useful software has a less prestigious job than the software engineer who runs software projects producing software that nobody will ever use.
  • The scientist who tinkers in his laboratory has a less prestigious job than the scientist who spends most of his time applying for research grants.

Note how money is not always immediately relevant. While it is normally the case that manual labor has lower pay, it is almost irrelevant. And indeed, plumbers make more than software developers in some parts of the world (like Montreal)… Even though software jobs are usually considered more desirable.

There are at least three problems with this social-status system:

  • Nature is the best teacher. Working on real problems makes you smart. The German philosopher Heidegger famously made this point with a Hammer. To paraphrase him, it is not by staring at a hammer that we learn about hammers. Similarly, scientists who do nothing but abstract work in the context of funding applications are missing out. The best scientists work in the laboratory, in the field; they tinker.
  • By removing ourselves from the world, we risk becoming alienated. We become strangers to the world around us. Instead, we construct this incoherent virtual reality which has often much to do with soviet-era industrialism. We must constantly remain vague because truth has become subjective. Whereas the hammer hits, whereas the software crashes, whereas the experiment fails… projects are always successfully, marketing is always right and truth is arrived at by consensus. Yet, we know deep down that this virtual reality is unreal and we remain uneasy, trapped between reality and virtuality. The perfect example are the financial markets which are creating abstract products with agreed-upon values. As long as everyone plays along, the system works. Nobody must ever say that the emperor is naked. Everyone must accept the lies. Everything becomes gray.
  • Human beings like to make their own stuff. We value considerably more what we did ourselves. You may be able to buy computers for $200, but nothing will ever replace the computer you made yourself from scratch. It may be more economical to have some Indian programmers build your in-house software, but the satisfaction of building your own software is far more than what you get by merely funding it. Repairing your own house is a lot more satisfying than hiring handymen.

To summarize: trading practical work for high-level positions is prestigious, but it may make you dumber, alienated and unhappy. Back when I was a graduate student, we used to joke about the accident. The accident is what happens to successful professors: they suddenly become uninteresting, pompous, and… frankly… a tad stupid.

Thankfully, there is hope. The current financial crisis, mostly because it couldn’t happen according to most economists, was a waking call. The abstract thinkers may not be so reliable after all! The millions of college graduates who are underemployed in wealthy countries all around the globe have unanswered questions. Weren’t these high-level abstract college degrees supposed to pay for themselves?

How do we fix this broken caste system and bring back a healthier relationship with work? Alas, we cannot all become plumbers and electricians. But it seems to me that more and more people are realizing that the current system, with its neat white collar jobs and rising inequalities, could be improved upon drastically. The Do it yourself (or Do it with others) wave has been a revelation for me. Yes: Chinese factories can build digital thermometers much cheaper than I can. But making your own digital thermometer is far more satisfying. Saving money by abstracting out reality is not a good deal. And of course, building real systems is not the same as finding money for your students to do it for you.

Further reading: Working long hours is stupid and Formal definitions are less useful than you think.

Public funding for science?

Terence Kealey has been arguing against public funding of science. Is it efficient to fund science with government dollars? He argues that when science is mostly funded by large government agencies, other funding sources are effectively crowded out. He has two good historical example. Firstly, while France massively invested in research and academic institutions in the 17th and 18th centuries, the United Kingdom, and not France, gave birth to the industrial revolution and the accompanying scientific surge. Secondly, the United States was leading the world in technological innovation starting in the 19th century whereas it had a comparative underdeveloped academic system, and no public research funding.

In short, whereas there is a correlation between wealth and scientific output, there is no evidence that public science funding generates economic growth. Moreover, government funding results in a concentration of power in the hands of few politicians. Trusting politicians with almost all of the research funding is a tad insane. It is even crazier to think that politicians have science in mind when allocating funding.

Kealey argues that for every dollar invested by the government, more than a dollar is withdrawn from research by private investors. While I don’t know whether this is true, I do know that I have no idea how I would go about asking for private funding, outside government programs, for my research. How do you go about it? Do you post a video on, say, kickstarter?

Note: I am a research grant recipient. The system has generally been good to me.

Can Science be wrong? You bet!

A common answer to my post on the reliability of science, was that fraud was marginal and that, ultimately, science is self-correcting. That is true on one condition: that the science in question is bona fide science. Otherwise, I disagree that institutional science is self-correcting. It is self-correcting about as much as human beings are rational. That is, not often. A lot of what passes for science is actually cargo cult science. What looks like rigorous science, may not be, no matter what the experts tell you. Don’t fool yourself: science is not the process of getting published in prestigious journals or a tool to get a tenured job. Richard Feynman defined science as the belief in the ignorance of experts.

Institutional science can be wrong or not even wrong for decades without any remorse:

  • Economists failed to predict or explain the last financial crisis. Yet they can’t put into questions their models. Philip Mirowski explains why: “The range in which dissent happens is so narrow. (…) The field got rid of methodological self-criticism.”
  • A large fraction of AI researchers have convinced themselves that intelligence must emerge from Prolog-like reasoning engines. This gave us twenty years of predictions that the future was in expert systems, and the last ten years spent predicting the rise of the Semantic Web. This ever-growing community of AI researchers are oblivious to their own failure to produce any useful result.
  • Like Fred Brooks, I’m amazed that in 2010, the waterfall method is taught in software engineering school as the reference model. There is no evidence that it is beneficial and, in fact, much evidence that it is hurtful. That is, students would be better off learning nothing rather than learning to use the waterfall method. Yet, entire Ph.D. thesis are still built on the assumption that the waterfall method is sound. Accordingly, criticizing the waterfall method on campus is a risky business.
  • The dominant paradigm of modern Theoretical Physics is String theory, which is not even a scientific theory.

We should not trust that self-correction will happen. Instead, biases are often self-reinforcing. Rather, we must ask how self-correction can happen. I think that all science must be verified by independently designed and reproduced experiments. For example, it is insufficient to verify the speed of light with one reproducible experiment. It must be possible for different researchers to come up independently with different experiments, which are all reproduced independently several times. And if everyone is working from the same data, the limitations of the data may never be revealed. And if there is no experiment, you are doing Mathematics or art, not science.

Peer review does not lead to self-correction. Peer review increases quality, but it can also reinforce biases. In Information Retrieval, we often talk about the trade-off between precision and recall. Peer review improves precision, but degrades recall. If your primary goal is to please your peers, you won’t be tempted to point out the flaws in their research!

However, I am optimistic for the future. The rise of Open Scholarship will allow outsiders to participate in the research process and keep it more honest.

How reliable is science?

It is not difficult find instances of fraud in science:

How did these people fare after being caught?

  • Ranjit Chandra still holds the Order of Canada, as far as I can tell. According to Scopus, his 272 research papers were cited over 3000 times. As for his University? Let me quote wikipedia: University officials claimed that the university was unable to make a case for research fraud because the raw data on which a proper evaluation could be made had gone missing. Because the accusation was that the data did not exist, this was a puzzling rationale.
  • According to Scopus, Woo-suk Hwang has been cited over 2000 times. Despite having faked research results and having committed major ethics violations, he has kept his job and… he is still publishing.
  • Despite all the retracted papers, Jan Hendrik Schön has still 1,200 citations according to Scopus. He lost his research job, but found an engineering position in Germany.

Conclusion: Scientific fraud is a low-risk, high-reward activity.

What is more critical is that we still equate peer review with correctness. The argument usually goes as follows: if it is important work, work that people rely upon, and it has been peer reviewed, then it must be correct. In sum, we think that conventional peer review + citations means validation. I think we are wrong:

  • Conventional peer review is shallow. Chandra, Hwang and Schön published faked results for many years in the most prestigious venues. The truth is that reviewers do not reproduce results. They usually do not have access to the raw data and software. And even if they did, they are unlikely to be motivated to redo all of the work to verify it.
  • Citations are not validations. Chandra, Hwang and Schön were generously cited. It is hardly surprising: impressive results are more likely to be cited. And doctored results are usually more impressive. Yet, scientists do not reproduce earlier work. Even if you do try to reproduce someone’s result, and fail, you probably won’t publish it. Indeed, publishing negative results is hard: journals are not interested. Moreover, there is a risk that it may backfire: the authors could go on the offensive. They could question your own competence.
  • There are many small frauds. Even without making up data, you can cheat by misleading the reader, by omission. You can present the data in creative ways, e.g. turn meaningless averages into hard facts by omitting the variance (see the fallacy of absolute numbers). These small frauds increase the likelihood that your paper will be accepted and then generously cited.

How do we solve the problem? (1) By trusting unimpressive results more than impressive ones. (2) By being suspicious of popular trends. (3) By running our own experiments.

Further reading: Become independent of peer review, The purpose of peer review and Peer review is an honor-based system.

Source: Seth Roberts.

Manifesto for Half-Arsed Academic Research

  • Research results are more important than the number of publications or citations.
    This is fine. Yet, we don’t have time to read your papers. So, just keep publishing a lot of papers each year. And get your influential friends to cite you. That’s how we’ll know whether you are good.
  • Science and truth are more important than spin and marketing.
    Yes, but keep pretending you will solve world hunger. And align your research results with the current fashionable trends.
  • You cannot tell where the next science breakthrough is going to come from.
    Maybe. Still, we want a plan of your research activities for the next five years.

Further reading: The hard truth about research grants and The secret behind radical innovation.

Source : Manifesto for Half-Arsed Agile Software Development via John D. Cook.

Counterintuitive factors determining research productivity

  • Permanent researchers publish more when they are in smaller labs.
  • Having many Ph.D. students fails to improve productivity.
  • Funding has little effect on research productivity.

Reference: Carayol, N. and Matt, M., Individual and collective determinants of academic scientists’ productivity, Information Economics and Policy 18 (1), 2006.

Further reading (on this blog): To be smarter, ignore external rewards, Is collaboration correlated with productivity?, Big schools are no longer giving researchers an edge?