For case-insensitive string comparisons, avoid char-by-char functions

Sometimes we need to compare strings in a case-insensitive manner. For example, you might want ‘abc’ and ‘ABC’ to be considered. It is a well-defined problem for ASCII strings. In C/C++, there are basically two common approaches. You can do whole string comparisons:

bool isequal = (strncasecmp(string1, string2, N) == 0);

Or you can do character-by-character comparisons, mapping each and every character to a lower-case version and comparing that:

bool isequal{true};
for (size_t i = 0; i < N; i++) {
      if (tolower(string1[i]) != tolower(string2[i])) {
        is_the_same = false;

Intuitively, the second version is worse because it requires more code. We might also expect it to be slower. How much slower? I wrote a quick benchmark to test it out:

strncasecmp Linux/GNU GCC macOS/LLVM
strncasecmp 0.15 ns/byte 1 ns/byte
tolower 4.5 ns/byte 4.0 ns/byte

I got these results with GNU GCC under Linux. And on a different machine running macOS.

So for sizeable strings, the character-by-character approach might be 4 to 40 times slower! Results will vary depending on your standard library and of the time of the day. However, in all my tests, strncasecmp is always substantially faster. (Note: Microsoft provides similar functions under different names, see _strnicmp for example.)

Could you go faster? I tried rolling my own and it runs at about 0.3 ns/byte. So it is faster than the competition under macOS, but slower under Linux. I suspect that the standard library under Linux must rely on a vectorized implementation which might explain how it beats me by a factor of two.

I bet that if we use vectorization, we can beat the standard librairies.


My code is available.

Sampling efficiently from groups

Suppose that you have to sample a student at random in a school. However, you cannot go into a classroom and just pick a student. All you are allowed to do is to pick a classroom, and then let the teacher pick a student at random. The student is then removed from the classroom. You may then have to pick a student at random, again, but with a classroom that is now missing a student. Importantly, I care about the order in which the students were picked. But I also have a limited memory: there are too many students, and I can’t remember too many. I can barely remember how many students I have per classroom.

If the classrooms contain a variable number of students, you cannot just pick classrooms at random. If you do so, students in classrooms with lots of students are going to be less likely to be picked.

So instead, you can do it as follows. If there are N students in total, pick a number x between 1 and N. If the number x is smaller or equal to the number of students in the first classroom, pick the first classroom and stop. If the number is smaller or equal to the number of students in the first two classrooms, pick the second classroom, and so forth. If you remove the student, you have to account for the fact that the sum (N) is reduced and the number of students in the classroom is being updated.

    while(N > 0) {
            int x = r.nextInt(sum); // random integer in [0,sum)
            int runningsum = 0;
            int z = 0;
            for(; z < runninghisto.length; z++) {
                runningsum += runninghisto[z];
                if(y < runningsum) { break; }
            // found z
            runninghisto[z] -= 1;
            N -= 1;

If you have many classrooms, this algorithm can be naive because each time you must pick a classroom you may need to iterate over all classroom counts. Suppose you have K classroom: you may need to do N K tasks to solve the problem.

Importantly, the problem is dynamic: the number of students in each classroom is not fixed. You cannot simply precompute some kind of map once and for all.

Can you do better than the naive algorithm? I can craft a “logarithmic” algorithm, one where the number of operations you do is the logarithm of the number of classrooms. I believe that my algorithm is efficient in practice.

So divide your classrooms into two sets of classrooms of size R and S so that R + S = N. Pick an integer in [0,N). If it is smaller than R, decrement R and pick a buffer in the corresponding set. If it is larger than or equal to R, then go with the other set of buffers. So we have reduced the size of the problem in two. We can repeat the same trick recursively. Divide the two subsets of classrooms and so forth. Instead of merely maintaining the total number of students in all of the classrooms, you keep track of the counts in a tree-like manner.

We can operationalize this strategy efficiently in software:

  • Build an array containing the number students in each classroom.
  • Replace every two counts by the sum of two counts: instead of storing the number of students in second classroom, store the number of students in the first two classrooms.
  • Replace every four counts by the sum of the four counts: where the count of the fourth class was, store the sum of the number of students in the first four classrooms.
  • And so forth.

In Java, it might look like this:

int level = 1;
for(;(1<<level) < runninghisto.length; level++) {
  for(int z = 0; 
     z + (1<<level) < runninghisto.length; 
    z += 2*(1<<level)) {
       runninghisto[z + (1<<level)] += runninghisto[z];

You effectively have a tree of prefixes which you can then use to find the right classroom without visiting all of the classroom counts. You visit a tree with a height that is the logarithm of the number of classrooms. At each level, you may have to decrement a counter.

        while(N > 0) {
            int y = r.nextInt(sum);
            // select logarithmic time
            level = maxlevel;
            int offset = 0;
            for(; level >= 0; level -= 1) {
                if(y >= runninghisto[offset + (1<<level)]) {
                    runninghisto[offset + (1<<level)] -= 1;
                    offset += (1<<level);
            // found offset
            N -= 1;

I have written a Java demonstration of this idea. For simplicity, I assume that the number of classrooms is a power of two, but this limitation could be lifted by doing a bit more work. Instead of N K operations, you can now do only N log K operations.

Importantly, the second approach uses no more memory than the first one. All you need is enough memory to maintain the histogram.

My Java code includes a benchmark.

Update. If you are willing to use batch processing and use a little bit of extra memory, you might be able to do even better in some cases. Go back to the original algorithm. Instead of picking one integer in [0,N), pick C for some number C. Then find out in which room each of these k integers fit, as before, decrementing the counts. If you pick the C integers in sorted order, a simple sequential scan should be enough. For C small compared to N, you can pick the C integers in linear time using effective reservoir sampling. Then you can shuffle the result again using Knuth’s random shuffle, in linear time. So you get to do about N K / C operations. (Credit: Travis Downs and Richard Startin.)

Further reading: Hierarchical bin buffering: Online local moments for dynamic external memory arrays, ACM Transactions on Algorithms.

Follow-up: Sampling Whisky by Richard Startin

Science and Technology links (April 25th 2020)

  1. People’s muscles tends to become weaker with age, a process called sarcopenia. It appears that eating more fruits and vegetables is associated with far lower risks of sarcopenia. Unfortunately, it is merely an association: it does not mean that if you eat more fruits and vegetables, you are at lower risk for sarcopenia. We just do not know.
  2. In human beings, males are more diverse than female with respect to many attributes. Just look at height: it is common that, in a classroom, both the shortest and tallest individual are boys. In chimpanzees, researchers report that males have greater brain structure diversity.
  3. Scientists have created “healing patches” for hearts. They greatly accelerate heart repair following a heart attack, in mice and pigs. It is made out of heart pig tissue, with the pig cells removed. The patches are believed to be safe, they can be frozen for later use.
  4. Many people lose some of their hair over time. About half of all men go bald over time. We do not quite understand the process. New research suggests that hair loss may result from the loss of stem cells in the skin. Stem cells are “progenitor cells” able to produce new specialized cells. The research suggests that stem cell therapies could be beneficial to fight hair loss.
  5. It is commonly believed that evolution made us as fit as possible. In this model, aging is the inevitable result of entropy. I do not believe this entropic model. In simulations, researchers have shown that a population of worms might be fitter if its individuals age and die. This should not surprise us: annual plants have evolved out of perennial plants. Your garden flowers can often be easily be kept alive indefinitely even though they are annuals. Your plants age and die on schedule because it is advantageous for the species. Animals of a similar size have frequently vastly different lifespans… naked mole rats are rats that are not subject to Gompertz curves (they are effectively ageless).

Rounding integers to even, efficiently

When dividing a numerator n by a divisor d, most programming languages round “down”. It means that 1/2 is 0. Mathematicians will insist that 1/2 and claim that you really are computing floor(1/2). But let me think like a programmer. So 3/2 is 1.

If you always want to round up, you can instead compute (– 1)/d.  If I want to apply it to n is 3 and d is 2, I do (3 + 2 – 1) /2 = 2.

We sometimes want to round the value to the nearest integer. So 1.4 becomes 1, 1.6 becomes 2. But what if you have 1.5 (the result of 3/2)? You can either round up or round down.

  • You can round “up” when hitting the midpoint with  (d/2)/d.
  • You can round “down” when hitting the midpoint with (– d/2)/d.

But there is a third common way to round. You want to round to even. It means that when you hit the midpoint, you round to the nearest even integer. So 1.5 becomes 2, 2.5 becomes 2, 3.5 becomes 4 and so forth. I am sure that it has been worked out before, but I could not find an example so I rolled my own:

off = (n + d / 2)
roundup = off / d;
ismultiple = ((off % d) == 0);
iseven = (d & 1) == 0;
return (ismultiple && iseven) ? roundup - (roundup & 1) : roundup;

Though there is a comparison and what appears like a branch, I expect most compilers to produce essentially branchless code. The result should be about five instructions on most hardware. I expect that it will be faster than converting the result to a floating-point number, rounding it up and converting the resulting floating-point number back.

I have posted working source code.

Why does it work?

Firstly, you never have to worry about hitting the midpoint if the divisor is odd. The remainder of a division by an odd number cannot be equal to d/2. Thus, starting from the round-up approach (d/2)/d we only need to correct the result. When n is equal to d + d/2, we need to round up, when it is equal to 2 d + d/2 we need to round down and so forth.  So we only need to remove 1 from (d/2)/when n is equal to 2kd+d/2 for some integer k. But when n is equal to 2kd+d/2, I have that n + d/2 is equal to (2k+1)d. That is, the quotient is odd.

Even better: You can make it explicitly branchless (credit: Falk Hüffner):

off = (n + d / 2)
roundup = off / d;
ismultiple = ((off % d) == 0);
return (d | (roundup ^ ismultiple)) & roundup

Nitpicking: You may be concerned with overflows. Indeed, if both n and d are large, it is possible for n+d/2 to exceed to allowable range. However, the above functions are used in the Linux kernel. And you can make them more robust at the expensive of a little bit more work.

Credit: This blog post was motivated by an email exchange with Colin Bartlett,

Science and Technology links (April 11th 2020)

  1. Greenland sharks reach their sexual maturity when they are 150 years old and they live hundreds of years. Some living sharks today were born in the 16th century.
  2. Only 5% of the general population rates themselves as below average in intelligence. And they overestimate the intelligence of their mate even more.
  3. Fresh soil has a distinctive smell which may be caused by bacteria that are trying to attract specific animals.
  4. Earth’s mantle is kept warm due to radioactivity. So we are basically living on a warm radioactive rock.
  5. Employees who work from home can be noticeably more productive. My own estimate is that I am about 20% more productive when working from home. However, social interactions are easier in person.
  6. We had the coldest Artic winter since 1979. It caused the first genuine ozone hole in the Artic. The ozone layer protects the Earth from radiations.
  7. According a recent research paper in Nature, Fermented foods (cheese, butter, alcoholic beverages, yogurt) as well as fibers may have anti-obesity properties due to the production of short-chain fatty acids (acetate, propionate, and butyrate). Propionate is also used as a food preservative as well as being produced by our own gut bacteria when digesting fibers, or by bacteria in the process of fermentation.  There is much evidence that propionate may helpful for obesity: Lin et al. (2012), Chambers et al. (2019), Arora et al. (2011) . In contrast, a widely reported previous work by Tirosh et al. suggest that propionate leads to weight gain in mice and might promote obesity and metabolic abnormalities. The press reported the idea that propionate could make use fat by presenting it as a food preservative:

    My take? The science is probably “complicated” and we do not have all the answers. This being said, while it is true that there may be propionate in your bread and pizza, to prevent it from moulding, and while it is also true that pizza and bread may make you fat, I’d worry a lot more about the calories in bread and pizza rather than the effect of propionate itself.

  8. The current pandemics might be the start of a revolution in peer review.

Multiplying backward for profit

Most programming languages have integer types with arithmetic operations like multiplications, additions and so forth. Our main processors support 64-bit integers which means that you can deal with rather large integers. However, you cannot represent everything with a 64-bit integer. What if you need to multiply 5100 by 37? Programming languages like Python or JavaScript will gracefully handle it. In programming languages like C/C++, you will need to use a library. In Java, you will have to use special classes part of the standard library.

You will almost never need to deal with integers that do not fit in 64 bits, unless you are doing cryptography or some esoteric numerical analysis. Nevertheless, it does happen that you need to work with very large integers.

How does the product 5100 by 37 gets computed if processors are limited to 64-bit words? Underneath, your software represents such large integers using multiple 64-bit words. The standard approach computes the product starting from the least significant bits. Thus if you are multiplying an integer that requires a single machine word (w) with an integer that requires n machine words, you will use n multiplications starting with a multiplication between the word w and the least significant word of the other integer, going up to the most significant words. The code in C/C++ might look like so:

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit
output[0] = p.low;// least significant 64-bits
uint64_t r = p.high;// most significant 64-bits
for (i = 1; i < n; i++) {
  p = full_multiplication(w, b[i]);
  p.low += r; // add r with carry
  if(p.low < r) p.high++;
  output[i] = p.low; // least significant 64-bits
  r = p.high;

It is akin to how we are taught to multiply in school: start with the least significant digits, and go up. I am assuming that all integers are positive: it is not difficult to generalize the problem to include signed integers.

So far, it is quite boring. So let me add a twist.

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[0]); // 64-bit x 64-bit => 128-bit
output[0] = p.low;// least significant 64-bits
uint64_t r = p.high;// most significant 64-bits
for (i = 1; i < n; i++) {
p = full_multiplication(w, b[i]);
p.low += r; // add r with carry
if(p.low < r) p.high++;
output[i] = p.low; // least significant 64-bits
r = p.high;
output[n] = r;

You can also multiply backward, starting from the most significant words. Though you might think it would be less efficient, you can still do the same product using the same n multiplications. The code is going to be a bit more complicated because you have to carry the overflow that you may encounter in the less significant words upward. Nevertheless, you can implement it in C/C++ using only a few extra lines of code. According to my hasty tests, it is only marginally slower (by about 20%).

// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit
uint64_t r = p.high// least significant 64-bits ;
output[n - 1] = p.low;// most significant 64-bits
output[n] = p.high; 
for (i = n-2; i >=0; i--) {
    p = full_multiplication(w, b[i]);
    output[i] = p.low; // least significant 64-bits 
    // check for overflow
    bool overflow = (output[i + 1] + p.high < output[i + 1]);
    output[i + 1] += p.high;
    for (size_t j = i + 2; overflow; j++) {
      output[j]++; // propagate the carry
      overflow = (output[j] == 0);
// multiply w * b where b contains n words, n > 0
p = full_multiplication(w, b[n-1]); // 64-bit x 64-bit => 128-bit
uint64_t r = p.high// least significant 64-bits ;
output[n – 1] = p.low;// most significant 64-bits
output[n] = p.high;
for (i = n-2; i >=0; i–) {
p = full_multiplication(w, b[i]);
output[i] = p.low; // least significant 64-bits
// check for overflow
bool overflow = (output[i + 1] + p.high < output[i + 1]);
output[i + 1] += p.high;
for (size_t j = i + 2; overflow; j++) {
output[j]++ // propagate the carry;
overflow = (output[j] == 0);

Why would computing the multiplication backward ever be useful?

Suppose that you are not interested in the full product. Maybe you just need to check whether the result in within some bounds, or maybe you just need a good approximation of the product. Then starting from the most significant bits could be helpful if you can stop the computation after you have enough significant bits.

It turns out that you can do so, efficiently. You can compute the most significant k words using no more than an expected + 0.5 multiplications. Furthermore, if you are careful, you can later resume the computation and complete it.

At each step, after doing k multiplications, and computing k + 1 words, these k + 1 most significant words are possibly underestimating the true k + 1 most significant words because we have not added the carry from the less significant multiplications. However, we can bound the value of the carry: it is less than w. To see that it must be so, let r be the number of remaining words in the multiword integer that we have not yet multiplied by w. The maximal value of these words is 264r – 1. So we are going to add, at most, (264r – 1)w to our partially computed integer: the overflow (carry) above the r is at most ((264r – 1)w)/264a value strictly less than w.

This means that if we add w – 1 to the least significant of the k + 1 words and it does not overflow, then we know that the k most significant words are exact, and we can stop. This might happen, roughly, half the time, assuming that we are dealing with random inputs. When it does overflow, you can just continue and compute one more product. If, again, adding w – 1 to the least significant word you computed does not create an overflow, you can stop, confident that the + 1 most significant words are exact.

However, you then gain another more powerful stopping condition: if the second least significant word is not exactly 264 – 1, then you can also stop, confident that k most significant words are exact, because adding w to the least significant word can, at most, translate in a carry of +1 to the second least significant word. Because it is quite unlikely that you will end up with exactly the value 264 – 1, we know that, probabilistically, you will not need more than k + 1 multiplications to compute exactly the k more significant words. And quite often, you will be able to stop after k multiplications.

My code implementing these ideas is a bit too long for a blog post, but I have published it as its own GitHub repository, so you can check it out for yourself. It comes complete with tests and benchmarks.

I have restricted my presentation to the case where at least one integer fits in a single word. Generalizing the problem to the case where you have two multiword integers is a fun problem. If you have ideas about it, I’d love to chat.

Implementation notes: The code is mostly standard C++. The main difficulty is to be able to compute the full multiplication which takes two 64-bit words and generates two 64-bit words to represent the product. To my knowledge, there is no standard way to do it in C/C++ but most compilers offer you some way to do it. At the CPU level, computing the full product is always supported (64-bit ARM and x64) with efficient instructions.

Credit: This work is inspired by notes and code from Micheal Eisel.

We released simdjson 0.3: the fastest JSON parser in the world is even better!

Last year (2019), we released the simjson library. It is a C++ library available under a liberal license (Apache) that can parse JSON documents very fast. How fast? We reach and exceed 3 gigabytes per second in many instances. It can also parse millions of small JSON documents per second.

The new version is much faster. How much faster? Last year, we could parse a file like simdjson at a speed of 2.0 GB/s, and then we reached 2.2 GB/s. We are now reaching 2.5 GB/s. Why go so fast? In comparison, a fast disk can reach  5 GB/s and the best network adapters are even faster.

The following plot presents the 2020 simdjson library (version 0.3) compared with the fastest standard compliant C++ JSON parsers (RapidJSON and sajson). It ran on a single Intel Skylake core, and the code was compiled with the GNU GCC 9 compiler. All tests are reproducible using Docker containers.

In this plot, RapidJSON and simjson have exact number parsing, while RapidJSON (fast float) and sajson use approximate number parsing. Furthermore, sajson has only partial unicode validation whereas other parsers offer exact encoding (UTF8) validation.

If we only improved the performance, it would already be amazing. But our new release pack a whole lot of improvements:

  1. Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually.
  2. Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. In the past, using simdjson was a bit of a chore, the new approach is definitively modern, see for yourself:
    auto cars_json = R"( [
      { "make": "Toyota", "model": "Camry",  "year": 2018, 
           "tire_pressure": [ 40.1, 39.9 ] },
      { "make": "Kia",    "model": "Soul",   "year": 2012, 
           "tire_pressure": [ 30.1, 31.0 ] },
      { "make": "Toyota", "model": "Tercel", "year": 1999, 
           "tire_pressure": [ 29.8, 30.0 ] }
    ] )"_padded;
    dom::parser parser;
    dom::array cars = parser.parse(cars_json).get<dom::array>();
    // Iterating through an array of objects
    for (dom::object car : cars) {
      // Accessing a field by name
      cout << "Make/Model: " << car["make"] 
               << "/" << car["model"] << endl;
      // Casting a JSON element to an integer
      uint64_t year = car["year"];
      cout << "- This car is " << 2020 - year 
               << "years old." << endl;
      // Iterating through an array of floats
      double total_tire_pressure = 0;
      for (double tire_pressure : car["tire_pressure"]) {
        total_tire_pressure += tire_pressure;
      cout << "- Average tire pressure: " 
          << (total_tire_pressure / 2) << endl;
      // Writing out all the information about the car
      for (auto [key, value] : car) {
        cout << "- " << key << ": " << value << endl;

  3. Exact Float Parsing: simdjson parses floats flawlessly at high speed.
  4. Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines. This means that you no longer need to check whether the system supports simdjson.
  5. Automatic allocation: as part of API simplification, the parser no longer has to be preallocated: it will adjust automatically when it encounters larger files.
  6. Runtime selection API: We have exposed simdjson’s runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
  7. Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code
    and cast straight to the value you need, and exceptions will be thrown automatically if an error happens. Use the same API either way!
  8. Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
  9. We now have a dedicated web site ( in addition to the GitHub site (

Credit: many people contributed to simdjson, but John Keiser played a substantial role worthy of mention.

Science and Technology links (March 28th 2020)

    1. In a laboratory, we know how to turn any of our cells into youthful stem cells using something called the Yamanaka. If you expose cells to such factors for a short time, they appear to remain functional specialized cells but become more youthful. Researchers demonstrated this theory using cartilage cells from elderly patients suffering from osteoarthritis. They also rejuvenated muscle cells. It now remains to do the same in live human beings.
    2. It has been widely reported that artificial intelligence, and specifically deep learning, can match or surpass clinicians in medical imaging tasks. Unfortunately, it appears that this is far from demonstrated with the necessary rigor:

      Few prospective deep learning studies and randomised trials exist in medical imaging. Most non-randomised trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small.

    3. Apple released its latest tablet (the iPad Pro) with an integrated LiDAR that can map accurately your environment at up to 5 meters of distance. A LiDAR is basically a laser-radar. It was famously used by NASA to map the lunar surface in the 1970s but it was a technology out of reach to all of us twenty years ago: reserved for military and highly specialized applications.
    4. Japan and Korea have more than 13 hospital beds per 1000 people; Spain, Italy, and the U.S. have about 3 beds per 1000 people.
    5. Due to a worldwide pandemic, we are running the largest distance-learning experiment in history. Countless teachers worldwide are teaching online for the first time.
    6. Modern disks (such as a USB drive) might be lighter when they are filled with data than when they are empty.
    7. Our smartphones will soon move from 4G networks to 5G networks. The latter are much faster. However, they cause the phones to consume 20% more energy according to some report.
    8. A few decades ago, most of us had computers with a single processor. Over time we acquired processors with two processor cores, each core acting as a processor. Then we got four cores with some cores able to act as if they are made of two or four “logical” processors. The next gaming consoles (e.g., the PS5) will have main CPUs with eight processor cores. It is not difficult to find servers that have well over a hundred logical processors. Yet it appears that Windows was limited to at most 64 logical processors, possibly because the architects of Windows never imagined that we would soon reach this limit.

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

Memory is organized in cache lines, frequently blocks of 64 bytes. On Intel and AMD processors, you can store and load memory in blocks of various sizes, such as 64 bits, 128 bits or 256 bits.

In the old days, and on some limited devices today, reading and storing to memory required you to respect alignment. You could not simply write any block memory anywhere. Today you mostly can write wherever you like. There is a small penalty for misalignment but the penalty is typically under 10% as I argued in my 2012 post Data alignment for speed: myth or reality?

Yet writing or reading from two cache lines (what Intel calls a cache line split) instead of a single one is likely to be more expensive at least some of the time. Let us explore an interesting scenario. It is sometimes possible to avoid crossing a cache line boundary by doing two memory accesses instead of a single large one. Is it worth it?

Cache lines in memory are aligned on addresses that are divisible by 64 bytes. Suppose that you would want to store 256 bits of data every 64 bytes, at just the right offset so that the 256 bits overlap two cache lines. You hit last 16 bytes of one cache line and the first 16 bytes of the second one. You can achieve the desired results by starting with an offset of 48 bytes. That is, you find find a memory address that is divisible by 64 bytes, and then you add 48 bytes.

In code, using Intel intrinsics, it looks as follow:

char * p = ...
for (size_t i = 0; i < ... ; i++) {
  _mm256_storeu_si256(p + i * 64, vec);

You can avoid entirely crossing the cache line bounding by first storing 128-bit of data at the 48-byte offset, and then storing another 128-bit of data. The first store is at the end of the first cache line and the second store is at the beginning of the second one.

char * p = ...
for (size_t i = 0; i < ... ; i++) {
      _mm_storeu_si128(p + i * 64, vec);
      _mm_storeu_si128(p + i * 64 + 16, vec);

How do these two approaches fare? I wrote a simple benchmark that stores many blocks of 256-bit at a 48-byte offset. It either stores it in one 256-bit step or in two 128-bit steps. I record the number of cycles per iteration on an AMD Rome processor. I rely on the the pre-installed RedHat LLVM compiler (clang version 3.4.2).

A single 256-bit write 2.33 cycles
Two 128-bit writes 2.08 cycles

It is a gain of slightly over 10% for the two 128-bit writes. What if you remove the 48-byte offset (or set it to zero)? Then both benchmark clock at 2.08 cycles per iteration. I expect that the 48-byte offset is a best-case scenario for the two 128-bit writes: if you change it then both approaches have the same cache-line overlap problems. So this 10% gain requires you to choose the alignment carefully.

My source code is available. Of course, your results will depend on the processor and to some extend on the compiler. You should be able to run my benchmark program on your own Linux x64 system.

Be mindful that if you are getting worse results on a per cycle basis on comparable hardware, you might be limited by your compiler. An analysis of the assembly might be required.

Further reading: Travis Downs has an interesting complementary analysis. He finds that unaligned stores crossing a 32-byte boundary can be tremendously expensive (i.e., 5 cycles) on the type of processor I am using for these tests (Zen 2). The 32-byte boundary exists irrespective of cache lines. Meanwhile, he finds that stores aligned exactly on a 32-byte boundary are much cheaper (1 cycle).

Number of atoms in the universe versus floating-point values

It is estimated that there are about 1080 atoms in the universe. The estimate for the total number of electrons is similar.

It is a huge number and it far exceeds the maximal value of a single-precision floating-point type in our current computers (which is about 1038).

Yet the maximal value that we can represent using the common double-precision floating-point type is larger than 10308. It is an unimaginable large number. There will never be any piece of engineering involving as many as 10308 parts.

Using a double-precision floating-point value, we can represent easily the number of atoms in the universe. We could also represent the number of ways you can pick any three individual atoms at random in the universe.

If your software ever produces a number so large that it will not fit in a double-precision floating-point value, chances are good that you have a bug.

Further reading: Lloyd N. Trefethen, Numerical Analysis, Princeton Companion to Mathematics, 2008


Science and Technology links (March 14th 2020)

    1. Mothers, but not fathers, possess gender-related implicit biases about emotion expression in children.
    2. Chinese researchers used to be offered cash rewards for publishing research articles. The Chinese government has banned such rewards. Increasingly, we are collectively realizing that a single-minded focus on publication numbers ensures systemic waste and failure. This is not unique to the business of research. You do not want to assess programmers by counting the number of lines of code they write, or writers by how many books they have published. The naive application of simple metrics can lead to disastrous results.
    3. A patient’s lung was removed, cleaned from cancer and put back.
    4. Many major American universities have moved classes online for the rest of the term due to the ongoing pandemic.
    5. Medical doctors keep on practicing many inefficient or harmful therapies, against all scientific evidence. For example, stents are frequently put into people at risk of cardiovascular disease, even though it is virtually useless.

Fast float parsing in practice

In our work parsing JSON documents as quickly as possible, we found that one of the most challenging problem is to parse numbers. That is, you want to take the string “1.3553e142” and convert it quickly to a double-precision floating-point number. You can use the strtod function from the standard C/C++ library, but it is quite slow. People who write fast parsers tend to roll their own number parsers (e.g., RapidJSON, sajson), and so we did. However, we sacrifice some standard compliance. You see, the floating-point standard that we all rely on (IEEE 754) has some hard-to-implement features like “round to even”. Sacrificing such fine points means that you can be off by one bit when decoding a string. As such, this never matters: double-precision numbers have more accuracy than any engineering project will ever need and a difference on the last bit is irrelevant. Nevertheless, it is mildly annoying.

A better alternative in C++ might be from_chars. Unfortunately, many standard libraries have not yet caught up the standard and they fail to support from_chars properly. One can get around this problem by using the excellent abseil library. It tends to be much faster than venerable strtod function.

Unfortunately, for our use cases, even abseil’s from_chars is much too slow. It can be two or three times slower than our fast-but-imperfect number parser.

I was going to leave it be. Yet Michael Eisel kept insisting that it should be possible to both follow the standard and achieve great speed. Michael gave me an outline. I was unconvinced. And then he gave me a code sample: it changed my mind. The full idea requires a whole blog post to explain, but the gist of it is that we can attempt to compute the answer, optimistically using a fast algorithm, and fall back on something else (like the standard library) as needed. It turns out that for the kind of numbers we find in JSON documents, we can parse 99% of them using a simple approach. All we have to do is correctly detect the error cases and bail out.

Your results will vary, but the next table gives the speed numbers from my home iMac (2017). The source code is available along with everything necessary to test it out (linux and macOS).

parser MB/s
fast_double_parser (new) 660 MB/s
abseil, from_chars 330 MB/s
double_conversion 250 MB/s
strtod 70 MB/s

Science and Technology links (March 7th 2020)

  1. The benefits of flu vaccines in healthy adults is modest. They do not reduce neonatal death, hospitalisations, or working day lost. It does not seem more helpful in older adults.
  2. While in 1970, only 1% of the American population was severely obese, the percentage of obesity is now at 9%.
  3. Older adults who volunteer and stay active maintain a larger brain.
  4. The election of Donald Trump in 2016 lead to slightly fewer baby boys in Canada. Stress tends to favor female births.
  5. The plague is still killing people in Madagascar. The disease is carried by rats.
  6. The social cost of carbon has been estimated to be as high as $60 a tonne, meaning that if we spent $60 to remove the production of a tonne of carbon, we would be even. Some of the latest recent estimates are much lower: between $0.60 and $3. These new estimates take into account the benefits of CO2 such as higher plant productivity.

Will calling “free” or “delete” in C/C++ release the memory to the system?

In the C programming language, we typically manage memory manually. A typical heap allocation is a call to malloc followed by a call to free. In C++, you have more options, but it is the same routines under the hood.

// allocate N kB
data = malloc(N*1024);
// do something with the memory
// ...
// release the memory

It stands to reason that if your program just started and the value of N is large, then the call to malloc will result in an increased memory usage by about N kilobytes. And indeed, it is the case.

So what is the memory usage of your process after the call to “free”? Did the N bytes return to the system?

The answer is that, in general, it is not the case. I wrote a small program under Linux that allocates N kilobytes and then frees them. It will then measure the RAM usage after the call to free. The exact results will depend on your system, standard library and so on, but I give my results as an illustration.

As you can observe in the table, the memory does sometimes get released, but only when it is a large block of over 30 MB in my tests. It is likely because in such cases a different code path is used (e.g., calling mmap, munmap). Otherwise, the process holds on to its memory, never again releasing it to the system.

memory requested memory usage after a free
1 kB 630 kB
100 kB 630 kB
1000 kB 2000 kB
10,000 kB 11,000 kB
20,000 kB 21,000 kB
30,000 kB 31,000 kB
40,000 kB 1,200 kB
50,000 kB 1,300 kB
100,000 kB 1,300 kB

Of course, there are ways to force the memory to be released to the system (e.g., malloc_trim may help), but you should not expect that it will do so by default.

Though I use C/C++ as a reference, the exact same effect is likely to occur in a wide range of programming languages.

What are the implications?

  • You cannot measure easily the memory usage of your data structures using the amount of memory that the processes use.
  • It is easy for a process that does not presently hold any data to appear to be using a lot of memory.

Further reading: glibc malloc inefficiency

Science and Technology links (February 29th 2020)

  1. No one really understands how planes fly. This puts a dent in the model whereas inventions follows theory.
  2. Physician salaries and diagnostic tests account for 4% of GDP in the US.
  3. A supplement (NMN) available to human beings has been used in mice: it restores fertility in old female mice.
  4. The investment firm Andreesen Horowitz is critical of artificial intelligence start-ups. It finds that they have low margins due to the cost of training models using many computers on the cloud. They may fail to scale due to the difficulty of generalizing use cases. And they are subject to commodification. That is, “artificial intelligence” as a business proposition may be far less profitable than typical “software” ventures.
  5. The most profitable businesses tend to be started by older people who have lots of experience in a specific industry.
  6. Dog can detect heat sources far away (1.6 meters) with their nose.
  7. Cancer patients received genetically edited immune cells (using CRISPR). It turned out well.

Fast divisionless computation of binomial coefficients

Suppose that I give you a set of n objects and I ask you to pick k distinct objects, thus forming a new subset. How many such subsets are there? If you have taken college mathematics, you have probably learned that the answer is given by binomial coefficients. They are defined as n!/(k! * (n-k)!) where the exclamation point refers to the factorial. For a programmer, it is easier to think of binomial coefficients as the result of this loop:

def binom(n,k):
    top = 1
    bottom = 1
    for i in 1, 2, ..., k:
        top *= n - i + 1
        bottom *= i
    return top/bottom

Though simple enough, this algorithmic definition is not practical if you are interested in performance. Both the numerator and the denominator grow large quickly. They will soon require several machine words. A programming language like Python will happily give you the correct answer, but slowly. In Java or C, you are likely to get the wrong result due to a silent overflow.

Of course, if you know that the binomial coefficient is too large to fit in a machine word (64 bits), then you may as well go to a big-integer library. But what if you know that the result fits in a machine word? Maybe you have a reasonable bound on the size of n.

Then instead of waiting at the very end before doing the division, you can divide at each iteration in the loop:

def binom(n,k):
    answer = 1
    for i in 1, 2, ..., k:
        answer = answer * (n - k + 1) / k
    return answer

This new approach may still overflow even if the binomial coefficient itself fits in a machine word because we multiply before dividing. You can get around this issue by first finding a common divisor to both the multiplier and the divisor, and factoring it out. Or else, you can further restrict the values of n and k. Let us choose this last path.

We still have as a problem that we need k-1 multiplications and divisions. The multiplications are relatively cheap, but the divisions have longer latencies. We would prefer to avoid divisions entirely. If we assume that k is small, then we can just use the fact that we can always replace a division by a known value with a shift and a multiplication. All that is needed is that we precompute the shift and the multiplier. If there are few possible values of k, we can precompute it with little effort.

Hence, if you know that, say, n is smaller than 100 and k smaller than 10, the following function will work…

uint64_t fastbinomial(int n, int k) {
  uint64_t np = n - k;
  uint64_t answer = np + 1;
  for (uint64_t z = 2; z <= (uint64_t)k; z++) {
    answer = answer * (np + z); 
    fastdiv_t f = precomputed[z];
    answer >>= f.shift;
    answer *= f.inverse;
  return answer;

I provide a full portable implementation complete with some tests. Though I use C, it should work as-is in many other programming languages. It should only take tens of CPU cycles to run. It is going to be much faster than implementations relying on divisions.

Another trick that you can put to good use is that the binomial coefficient is symmetric: you can replace k by nk and get the same value. Thus if you can handle small values of k, you can also handle values of k that are close to n. That is, the above function will also work for n is smaller than 100 and k larger than 90, if you just replace k by nk.

Is that the fastest approach? Not at all. Because n is smaller than 100 and k smaller than 10, we can precompute (memoize) all possible values. You only need an array of 1000 values. It should fit in 8kB without any attempt at compression. And I am sure you can make it fit in 4kB with a little bit of compression effort. Still, there are instances where relying on a precomputed table of several kilobytes and keeping them in cache is inconvenient. In such cases, the divisionless function would be a good choice.

Alternatively, if you are happy with approximations, you will find floating-point implementations.

Science and Technology links (February 22nd 2020)

    1. In a large cohort study, the highest probability of reaching 90 years old was found for those drinking between 5g and 15 g of alcohol per day. This does not mean that if you are not drinking, you should start.
    2. The Earth is getting greener thanks to CO2. In turn, a greener Earth will mitigate global warming. (Source: Nature)
    3. In 2019, the carbon emissions in the US fell by 2.9%. They fell by 5% in the European Union. They also fell in Japan. (Source: Bailey)
    4. Robots can take blood samples and apparently do competent job, according to a clinical trial.
    5. We may soon get 80-terabyte disk drives.
    6. The age-adjusted cancer rate in the US is currently about at the same level as it was in 1930. We are not winning the war against cancer.
    7. You are better off cutting your food on wooden planks, they are more hygienic that plastic planks.
    8. Science is undergoing what some people call the “reproducibility crisis”: may important results reported in prestigious venues cannot be reproduced by other scientists, independently. Miyakawa suggests that the reproducibility crisis might be related to the fact that studies are frequently fabricated:

      (…) more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning.

      A few years ago, I was on the PhD committee of a student. I questioned the results. Ultimately, we asked for the software that produced that data. The student quickly reported that the software had been lost, deleted by the University. We declined to grant the PhD despite an extensive publication record (with articles in some of the best venues). I paid a political price for my choice to fail the student. The student eventually did get a PhD after an appeal. I would not be surprised to learn that this student became a professor. The lesson is that you should always doubt scientific studies. Ask that they be independently reproduced.

My thoughts on how research funding is attributed in Canada to Computer Science

In Canada, most computer science professors seek funding with NSERC, the main Canadian funding agency for science and engineering. It is more or less the equivalent of the American NSF. The core NSERC program is called “discovery” and it funds 5-year research programs. So, roughly speaking, currently funded professors apply for funding about once every 5 years. Once funded, you do not have to pursue the program you proposed: we recognize that you cannot be expected to stupidly stay on the same course for five years in a fast moving field.

Applicants get a rating regarding the excellence of the researcher, the merit of their proposal and on how well they are training students. It is an all-around evaluation. It is quite stressful.

Not all professors are funded or seek funding. However, it is common for computer science departments to expect their new faculty to seek funding. At many places, getting funding is a necessary condition to get tenure. I would expect it to be the case at all top departments. In effect, the NSERC discovery program act as a Canada-wide peer review process.

The grants are modest: from about 20k$ a year to 100k$. Very few people get 100k$ a year, you basically have to be a Turing award recipient. So it is not a lot of money compared to what American professors get. In most cases, all the money goes to students. Professors cannot pay themselves. So getting a grant does not increase your salary.

In computer science, applications go to a committee made of between 40 to 50 people. Most are from Canadian universities, but there are some people from industry and from other countries. Each application is reviewed by a team of five committee member, supervised by a chair (who is also a member of the committee) as well as at least one program officer. There are also external reviews which are taken seriously, but are just one element among many. The applicants must provide samples of their research; they committee members browse and discuss these papers. And there is a 5-page proposal describing the science that the applicant wants to pursue.

I just finished a term as co-president of the committee. It is a lot of work. I could probably have written five more papers these last three years without this service responsibility. Let me add that it is unpaid.

Here are my take-away from the experience:

  1. We often hear that science is all about publishing lots and lots of papers. That is definitively not true. Once you put a bunch of professional computer scientists in a room and you tell them to judge someone… they quickly dismiss sheer volume. They seek quality. They seek actual science. They also tend to go with proxies, as in “they published in the best venue”. Yet, even there, it is not so much the volume that matters as the fact that specific journals and conferences are especially selective. Committee members are eager for explanations as to why the research is great; it is often the applicants themselves who are not forthcoming about details. If you wrote about a breakthrough in an online note or presented it at a local workshop, your peers will be happy to consider it, though you have a bit more explaining to do than if it appeared in a prestigious venue. And it is definitively possible to get the highest ratings without a pursuit of prestigious venues.
  2. We take services to the community and to society very seriously. It is not all about papers.
  3. I don’t think bibliometrics like citations get discussed much at all: again, it is not about numbers. Being highly cited can be good, but it is not the end game.
  4. It is surprising how even the most experienced researchers sometimes cannot describe competently a research proposal. Sometimes there are goals, but no means to achieve them. Sometimes the objectives are all over the map and incoherent. People will mix and match ideas in the most absurd manner.
  5. The committee is quite open to new and bold ideas. In fact, it rewards bold ideas if they are presented in a competent manner.
  6. Years after years, I have seen “old ideas” being praised when put into a solid program. Not everything good has to be about bitcoins and deep learning. That is, it is not required that you work on fashionable ideas. The committee has a lot of tolerance for unfashionable ideas.
  7. People who try to “pad” their resume to look impressive take risks. Committee members get annoyed very quickly at people gaming the system.
  8. Everyone gets assessed on the same grid. It does not matter whether you are at the best or worst university. It does not matter if you are 20 years old or 70 years old. Evidently, it does not matter whether you are man or not. So it is a system that is hard on younger, less experienced professors. It is also hard for people from small institutions. If you are both inexperienced and from a small institution, it is especially difficult. The counterpart is that if you do well while being at a small institution, you should be especially proud of yourself.
  9. Unfortunately, as with every year, many professors will get bad news. They will have failed to get funding. This may mean that they will soon lose their job. Some people are just bad at what they do or they have bad ideas: it is a service to tell them loud and clear. But most people who fail to receive funding at not bad. In many cases, the merit of their work was clear. This is not a perfect system, nor can it be. Some people simply have not yet had the means to reach the thresholds set by the grid. Some people do work that just do not fit well with the grid. This makes me sad and slightly depressed, but there is no easy fix. In my department, we specifically do not require that new professors get a discovery grant to receive tenure. We do our own assessment.
  10. It is definitively not all about politics. I have heard rumours about people from a given school of thought trying to sink people from another school, or people trying to be advocate for their own speciality. I am sure it happens. However, when it is observed, it does get some attention. Nothing is ever perfect, but we don’t let politics take over.

All in all, I feel better about my peers and about computer science after this experience at NSERC. I am generally very concerned about quality in research. There is a lot of junk out there. Yet there is also a lot of good people try to do good work. My expectation is that the Canadian system is probably one of the best in the world because it puts quality and good science first.


Further reading: Ebrahim Bagheri was a fellow committee member and he wrote about his experience on twitter.

Science and Technology links (February 8th 2020)

  1. It is often believed that radiations are bad for you. To the contrary, David et al. report that life expectancy is approximately 2.5 years longer in people living in areas with an elevated background radiation in the USA.
  2. Birth order, that is whether you are the oldest or youngest sibling, is independently associated with a number of health and performance metrics. The first born is usually more fit and stable. Lin et al. argue that the main explanation might be that younger siblings are more likely to have been unwanted.
  3. The University of California has completely cancelled its subscription to research papers by Elsevier. Elsevier is one of the most important scientific publisher. It is also a for-profit publisher.
  4. Low levels of Low-density lipoprotein cholesterol (LDL-C), often qualified as “bad cholesterol”, are strongly and independently associated with increased risk of cancer, cardiovascular diseases, and all-cause mortality according to a Korean study made of hundreds of thousands of human subjects. This new study puts into question mainstream beliefs about “bad cholesterol”.
  5. Education is strongly associated with better health and better longevity. However, after controlling for income and living conditions, the relationship between health and education evaporates.
  6. Harvard’s George Church has created a startup that will use an anti-aging gene therapy to increase the longevity of dogs. It is based on previous work done on mice and reported in a paper entitled A single combination gene therapy treats multiple age-related diseases (source: PNAS).
  7. As we age, 90% of us will get gray hair. It is often believed to be an irreversible process. Researchers at the University of Alabama found strong evidence that it is not the case and they believe that hair graying can be reversed. They are launching a company to develop the therapy. Note that there is documented but anecdotal evidence for gray-hair reversal, e.g., in cancer patients.