Hashing is a common concept in programming. It is the process of mapping any object to an integer. It is used in fast search algorithms (like Rabin-Karp) and in hash maps. Almost all non-trivial software depends on hashing. Hashing often works best when it appears random.

It used to be that hashing was deterministic. The creators of a language would specify once and for all how a given string would be hashed. For example, in Java, the hash value of the string “a” is 97 ("a".hashCode()). No matter how often you run a program, the hash value will always be 97.

This can be a security problem because an adversary can find many strings that hash to the value 97. And then hashing no longer appears random.

For this reason, language designers are migrating to random hashing. In random hashing, the hash value of the string “a” can change every time you run your program. Random hashing is now the default in Python (as of version 3.3), Ruby (as of version 1.9) and Perl (as of version 5.18).

In Ruby, random hashing means that if you run the following program many times, you will get different outputs…

puts "a".hash

The same is true of the following Python program…

print ({"a":1,"b":2})

Here is the output I got, running it twice:

$ python3 tmp.py
{'a': 1, 'b': 2}
$ python3 tmp.py
{'b': 2, 'a': 1}

To my knowledge, Java still relies on deterministic hashing by default. However, the specification requires you to expect random hashing: the return value of hashCode() can change from one program execution to another.

How many Ruby, Python and Perl programmers realize that they are relying on random hashing?

Fernando Pérez gave a talk at Pycon 2014 with a brilliant slide:

The ideals reality of science:

  • The pursuit of verifiable answers highly cited papers for your c.v.
  • The validation of our results by reproduction convincing referees who did not see your code or data
  • An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding

Credit: Bill Tozier for the pointer.

We all rely daily on free and open source software, whether we know it or not. The entire Internet is held together by open source software. The cheap router that powers you Wifi network at home uses the Linux kernel. Your android phone is based on the Linux kernel. Google servers run Linux. In 2014, almost everyone is a Linux user.

For most people, the financial value of this software is an abstract concept. I think that most people assume that open source software must be cheap.

On the contrary, producing quality open source software is tremendously expensive. And the financial investment grows every year.

How much did it cost to write the millions of lines of the Linux kernel? García-García and de Magdaleno estimated the cost of the Linux kernel, as of 2010, to 1.2 billion euros. That is how much it would cost of any one company to redo the Linux kernel from scratch.

You might assume that programmers working on the Linux kernel are hopeless nerds who live in their parent’s basement. In fact, most of them are highly qualified engineers earning 6-figure salaries or better. So the financial estimate represents real money. It is not a virtual cost.

Of course, the Linux kernel is a tiny fraction of all the open source software we rely upon. Most open source developers will never contribute to the Linux kernel: it is reserved for a small elite. According to the Linux foundation, the cost of building a standard Linux distribution (in 2008) would have been over $10 billion.

So what is the value of all open source software beyond Linux?

It helps to realize that software is a huge business. In Europe, companies and governments spend over 200 billion euros a year building software. To put this in perspective, the movie industry in the US generates about 10 billion dollars in revenues. In the United States, 1 out of every 200 workers is a software engineer. A very sizeable fraction of all “engineering” today is in software.

Of course, not all of the software is open source. Still, Daffara estimates the financial value of open source software, for Europe alone, to over 100 billion euros a year.

So why don’t we have more open source drug designs, movie content, textbooks, and so on?

The common argument is that nobody will be willing to invest, in say, a new textbook, a new drug or a new show if anyone can copy and redistribute it for free—the investment is too large.

But I think that the real difference is cultural. In the software world, entire businesses grew surrounded by open source software. They learned to thrive with and through open source software. Companies that entirely reject open source are at a competitive disadvantage. The same happened in the fashion industry. Designers assume that other people will copy them. In fact, designers hope others will copy them.

Other industries, like the pharmaceutical or education industry, have internalized the patent and copyright systems. That is why college students have to pay over $100 for a typical textbook whereas they can get an operating system that costed billions to make for free.

I think that if we had had a world where it is fair game to copy and distribute a textbook for free, we would still have textbooks. I think they would still be excellent. I also think that textbook authors would get paid, just like the programmers do.

Would the overall result be better? I do not know but it is fascinating to imagine what such a parallel universe might look like.

Credit: Thanks to Christopher Smith for useful pointers.

The new C++ standard introduced hash functions and hash tables in the language (as “unordered maps”).

As every good programmer should know, hash tables only work well if collisions between keys are rare. That is, if you have two distinct keys k1 and k2, you want their hash values h(k1) and h(k2) to differ most of the time.

The C++ standard does not tell us how the keys are hashed but it gives us two rules:

  • The value returned by h(k) shall depend only on the argument k.
  • For two different values k1 and k2, the probability that h(k1) and h(k2) “compare equal” (sic) should be very small.

The first rule says that h(k) must be deterministic. This is in contrast with languages like Java where the hash value can depend on a random number if you want (as long as the value remains the same through throughout the execution of a given program).

It is a reasonable rule. It means that if you are iterating through the keys of an “unordered set”, you will always visit the keys in the same order… no matter how many times you run your program.

It also means, unfortunately, that if you find two values such that h(k1) and h(k2), then they will always be equal, for every program and every execution of said programs.

The second rule is less reasonable. We have that h(k1) and h(k2) are constant values that are always the same. There is no random model involved. Yet, somehow, we want that the probability that they will be the same be low.

I am guessing that they mean that if you pick k1 and k2 randomly, the probability that they will hash to the same value is low, but I am not sure. If it is what they mean, then it is a very weak requirement: a vendor could simply hash strings down to their first character. That is a terrible hash function!

I am under the impression that the next revision of the C++ standard will fix this issue by following in Java’s footstep and allow hash functions to vary from one run of a program to another. That is, C++ will embrace random hashing. This will help us build safer software.

I used to think that knowledge was strongly transferable. I believed that learning physics could make you a better mechanic. I believed that learning mathematics could make you a better physicist or computer scientist.

After learning a lot of physics and mathematics, I realized that I still found it difficult to write good software, learn about mechanics, design circuits, or understand economics.

This has fundamentally affected my worldview. For example, I no longer take for granted that studying theory can be useful in practice. For me, this is a radical change from my twenties when I believed that computer science wasn’t worth studying since it was just “applied mathematics”.

Since I no longer believe that knowledge is strongly transferable, I have become critical of schooling in general. I used to think that you got smarter with each new college class you took. So I took a lot of them. I took about 30% more classes necessary to graduate in college. In high school, I took extra mathematics classes outside of the regular schedule (by choice, I don’t even think my parents knew). Thus I know a lot about useless topics.

Some of these classes have turned out to be useful. But less because of the knowledge that they have given me, and more because they have built up my confidence.

For example, all my training in abstract algebra helps me a bit when I want to study random hashing. How much did it help? Well, at some point, I realized that I needed to brush up on Galois fields. I picked up an undergraduate text, read one chapter, and I was good to go. That is, I knew that I could learn quickly about Galois fields if needed.

However, taking dozen of classes is an expensive way to build up your confidence. A better way would be for you to learn a few difficult things on your own. For example, I hardly know anything about electronics and I never took any class in it, but I know that I could become good at it because I have mastered similar skills.

In any case, because I believe that knowledge is only weakly transferable, I favour learning practical skills that are immediately useful. If you want to become a great software engineer, learn to program better… don’t study latin.

Given a chance, many parents would cram their kids’ schedule with as many academic classes as possible. The hidden assumption is that kids get “smarter” as they take more classes. But this is almost surely wrong. Of course, there are clear benefits to taking swimming lessons (you learn to swim!) or karate lessons (you learn to fight!), but taking an extra mathematics class might not help as much as you think.

Next Page »

Powered by WordPress