Science and Technology links (Novembre 28th 2021)

  1. Government-funded research is getting more political and less diverse:

    The frequency of documents containing highly politicized terms has been increasing consistently over the last three decades. The most politicized field is Education & Human Resources. The least are Mathematical & Physical Sciences and Computer & Information Science & Engineering, although even they are significantly more politicized than any field was in 1990. At the same time, abstracts have been becoming more similar to each other over time. Taken together, the results imply that there has been a politicization of scientific funding in the US in recent years and a decrease in the diversity of ideas supported.

  2. Parabiosis is the process of tying the blood vessels of animals. Zhang et al. proceeded with parabiosis between young and old mice, followed by a detachment period. The old mice lived longer than control mice and they appear to have been rejuvenated by the parabiosis.
  3. A single injection to enable paralyzed mice to walk again.
  4. The Artic has been warming for much longer than we thought.
  5. Dog fed only once daily are healthier.
  6. India fertility rate is now below replacement rate.

Are tenured professors more likely to speak freely?

University professors often have robust job security after a time: they receive tenure. It means that they usually do not have to worry about applying for a new job after a few years.

Tenure is not available in all countries. Countries like Australia reassess positions every few years.

So why does it exist where it does?

One of the justifications for tenure is that professors who have tenure can speak more freely. Thus, in theory, they can be critical of government or corporate policies.

Do they? What would “speaking freely” entails?

What about denouncing a colleague who commits blatant fraud? On this front, the evidence is not great. Diederik Stapel published well over 100 research papers in prestigious journals. He was fired when it was determined that he was making up all of his research data. It took outsiders (students) to report him. Harvard professor Marc Hauser published over 200 papers in the best journals, making up data as he went. It took naive students to report the fraud. We find too many examples of over fraud in science, and rarely do we find that the close colleagues, the ones who should first spot the problems, report them. Brian Wansink was another famous professor who published countless papers based on fraudulent practices. It took media pressure as well as an investigation lead by a non-academic book publisher to take him down. I could go on. Professors rarely denounce other professors.

What about teaching controversial courses or engaging in disliked research? Ceci et al. found little evidence:

The findings from the present survey suggest that tenure itself does not result in faculty members routinely teaching courses that their senior colleagues disfavor, nor in their conducting research that their senior colleagues dislike.

Whenever professors tell me that they feel free to hold controversial ideas thanks to their tenure… I ask about the positions that they took recently that would endanger their job security if not for tenure. I might also ask about the positions that they might take that would endanger their chances of getting a research grant?

I should make it clear that being an advocate for transgenders’ rights or climate change is not controversial in 2021. I am looking for examples where a lone professor goes against the majority and would otherwise loose their job.

If not themselves, then I might ask about other professors that did so. And we will find some of them. In Canada, we have Jordan Peterson at the University of Toronto, for example. Yet I do not consider Peterson particularly controversial. In fact, at the core, Peterson is merely a politically conservative professor. We used to  have a healthy share of politically conservative professors. It is only in the academy that it is highly troublesome to be politically conservative. A truly controversial professor was Denis Rancourt who thought important to question the foundation of the academy. Sadly Rancourt was fired (tenure did not protect him). I might throw in with Rancourt people like Bret Weinstein, Kathleen Stock and so forth.

So while tenure might protect professors  if they want to hold controversial ideas… professors are trained to seek prestige: they avoid serious controversy when they can. Holding controversial views puts one’s reputation in danger. The overwhelming majority will never publicly hold controversial views. They are happy to go along with whatever is the policy of the day, whatever is fashionable.

It does not follow the tenure is useless. It appears that tenured professor are often more productive. Indeed, it is possible that if you do not have to worry about where your next job will be, you can concentrate more on your teaching and your research.

More content: Gad Saad and Pat Kambhampati are controversial tenured professors in Montreal. They are not the average professor.

Converting integers to fix-digit representations quickly

It is tricky to convert integers into strings because the number of characters can vary according to the amplitude of the integer. The integer ‘1’ requires a single character whereas the integer ‘100’ requires three characters. So a solution might possible need a hard-to-predict branch.

Let us simplify the problem.

Imagine that you want to serialize integers to fixed-digit strings. Thus you may want to convert 16-digit integers (up to 10000000000000000) to exactly 16 digits, including leading zeros if needed. In this manner, it is easy to write code that contains only trivial branches.

The simplest approach could be a character-by-character routine where I use the fact that the character ‘0’ in ASCII is just 0x30 (in hexadecimal):

void to_string_backlinear(uint64_t x, char *out) {
    for(int z = 0; z < 16; z++) {
        out[15-z] = (x % 10) + 0x30;
        x /= 10;

It is somewhat strange to write the characters backward, starting from the less significant digit. You can try to go forward, but it is a bit trickier. Here is one ugly approach that is probably not efficient:

void to_string_linear(uint64_t x, char *out) {
  out[0] = x / 1000000000000000 + 0x30;
  x %= 1000000000000000;
  out[1] = x / 100000000000000 + 0x30;
  x %= 100000000000000;
  out[2] = x / 10000000000000 + 0x30;
  x %= 10000000000000;
  out[3] = x / 1000000000000 + 0x30;
  x %= 1000000000000;
  out[4] = x / 100000000000 + 0x30;
  x %= 100000000000;
  out[5] = x / 10000000000 + 0x30;
  x %= 10000000000;
  out[6] = x / 1000000000 + 0x30;
  x %= 1000000000;
  out[7] = x / 100000000 + 0x30;
  x %= 100000000;
  out[8] = x / 10000000 + 0x30;
  x %= 10000000;
  out[9] = x / 1000000 + 0x30;
  x %= 1000000;
  out[10] = x / 100000 + 0x30;
  x %= 100000;
  out[11] = x / 10000 + 0x30;
  x %= 10000;
  out[12] = x / 1000 + 0x30;
  x %= 1000;
  out[13] = x / 100 + 0x30;
  x %= 100;
  out[14] = x / 10 + 0x30;
  x %= 10;
  out[15] = x + 0x30;

Instead we could try to do it in a tree-like manner to reduce data dependency during the computation and hope for more core-level parallelism. We first divide the integer 100000000 to compute the first and last 8 digits separately, and so forth. It should drastically decrease data dependencies:

void to_string_tree(uint64_t x, char *out) {
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;      
  uint64_t toptop = top / 10000;
  uint64_t topbottom = top % 10000;
  uint64_t bottomtop = bottom / 10000;
  uint64_t bottombottom = bottom % 10000;
  uint64_t toptoptop = toptop / 100;
  uint64_t toptopbottom = toptop % 100;
  uint64_t topbottomtop = topbottom / 100;
  uint64_t topbottombottom = topbottom % 100;
  uint64_t bottomtoptop = bottomtop / 100;
  uint64_t bottomtopbottom = bottomtop % 100;
  uint64_t bottombottomtop = bottombottom / 100;
  uint64_t bottombottombottom = bottombottom % 100;
  out[0] = toptoptop / 10 + 0x30;
  out[1] = toptoptop % 10 + 0x30;
  out[2] = toptopbottom / 10 + 0x30;
  out[3] = toptopbottom % 10 + 0x30;
  out[4] = topbottomtop / 10 + 0x30;
  out[5] = topbottomtop % 10 + 0x30;
  out[6] = topbottombottom / 10 + 0x30;
  out[7] = topbottombottom % 10 + 0x30;
  out[8] = bottomtoptop / 10 + 0x30;
  out[9] = bottomtoptop % 10 + 0x30;
  out[10] = bottomtopbottom / 10 + 0x30;
  out[11] = bottomtopbottom % 10 + 0x30;
  out[12] = bottombottomtop / 10 + 0x30;
  out[13] = bottombottomtop % 10 + 0x30;
  out[14] = bottombottombottom / 10 + 0x30;
  out[15] = bottombottombottom % 10 + 0x30;

We could also try to accelerate the computation with table lookups. We want to keep the tables small. We can effectively process the tail end of the tree-based technique by looking up small integers smaller than 100 by looking up their conversion: the integer 12 becomes the 2-character string ’12’ and so forth (my code could be nicer):

void to_string_tree_table(uint64_t x, char *out) {
  static const char table[200] = {
      0x30, 0x30, 0x30, 0x31, 0x30, 0x32, 0x30, 0x33, 0x30, 0x34, 0x30, 0x35,
      0x30, 0x36, 0x30, 0x37, 0x30, 0x38, 0x30, 0x39, 0x31, 0x30, 0x31, 0x31,
      0x31, 0x32, 0x31, 0x33, 0x31, 0x34, 0x31, 0x35, 0x31, 0x36, 0x31, 0x37,
      0x31, 0x38, 0x31, 0x39, 0x32, 0x30, 0x32, 0x31, 0x32, 0x32, 0x32, 0x33,
      0x32, 0x34, 0x32, 0x35, 0x32, 0x36, 0x32, 0x37, 0x32, 0x38, 0x32, 0x39,
      0x33, 0x30, 0x33, 0x31, 0x33, 0x32, 0x33, 0x33, 0x33, 0x34, 0x33, 0x35,
      0x33, 0x36, 0x33, 0x37, 0x33, 0x38, 0x33, 0x39, 0x34, 0x30, 0x34, 0x31,
      0x34, 0x32, 0x34, 0x33, 0x34, 0x34, 0x34, 0x35, 0x34, 0x36, 0x34, 0x37,
      0x34, 0x38, 0x34, 0x39, 0x35, 0x30, 0x35, 0x31, 0x35, 0x32, 0x35, 0x33,
      0x35, 0x34, 0x35, 0x35, 0x35, 0x36, 0x35, 0x37, 0x35, 0x38, 0x35, 0x39,
      0x36, 0x30, 0x36, 0x31, 0x36, 0x32, 0x36, 0x33, 0x36, 0x34, 0x36, 0x35,
      0x36, 0x36, 0x36, 0x37, 0x36, 0x38, 0x36, 0x39, 0x37, 0x30, 0x37, 0x31,
      0x37, 0x32, 0x37, 0x33, 0x37, 0x34, 0x37, 0x35, 0x37, 0x36, 0x37, 0x37,
      0x37, 0x38, 0x37, 0x39, 0x38, 0x30, 0x38, 0x31, 0x38, 0x32, 0x38, 0x33,
      0x38, 0x34, 0x38, 0x35, 0x38, 0x36, 0x38, 0x37, 0x38, 0x38, 0x38, 0x39,
      0x39, 0x30, 0x39, 0x31, 0x39, 0x32, 0x39, 0x33, 0x39, 0x34, 0x39, 0x35,
      0x39, 0x36, 0x39, 0x37, 0x39, 0x38, 0x39, 0x39,
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  uint64_t toptop = top / 10000;
  uint64_t topbottom = top % 10000;
  uint64_t bottomtop = bottom / 10000;
  uint64_t bottombottom = bottom % 10000;
  uint64_t toptoptop = toptop / 100;
  uint64_t toptopbottom = toptop % 100;
  uint64_t topbottomtop = topbottom / 100;
  uint64_t topbottombottom = topbottom % 100;
  uint64_t bottomtoptop = bottomtop / 100;
  uint64_t bottomtopbottom = bottomtop % 100;
  uint64_t bottombottomtop = bottombottom / 100;
  uint64_t bottombottombottom = bottombottom % 100;
  memcpy(out, &table[2 * toptoptop], 2);
  memcpy(out + 2, &table[2 * toptopbottom], 2);
  memcpy(out + 4, &table[2 * topbottomtop], 2);
  memcpy(out + 6, &table[2 * topbottombottom], 2);
  memcpy(out + 8, &table[2 * bottomtoptop], 2);
  memcpy(out + 10, &table[2 * bottomtopbottom], 2);
  memcpy(out + 12, &table[2 * bottombottomtop], 2);
  memcpy(out + 14, &table[2 * bottombottombottom], 2);

You can extend this trick if you are willing to include a 40kB table in your code:

void to_string_tree_bigtable(uint64_t x, char *out) {
  #include "bigtable.h"

  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  uint64_t toptop = top / 10000;
  uint64_t topbottom = top % 10000;
  uint64_t bottomtop = bottom / 10000;
  uint64_t bottombottom = bottom % 10000;

  memcpy(out, &bigtable[4 * toptop], 4);
  memcpy(out + 4, &bigtable[4 * topbottom], 4);
  memcpy(out + 8, &bigtable[4 * bottomtop], 4);
  memcpy(out + 12, &bigtable[4 * bottombottom], 4);

An intermediate solution with a 3-character table would only require a 3kB table. I also consider Muła’s SIMD-based approach though I refer you to his article for details. Effectively Muła use fancy Intel-specific instructions to get the job done.

If you cannot use SIMD instructions, you can use something similar called SWAR. Effectively, you pack several integer values inside a wide integer (64 bits) and you try to somehow save instructions. Luckily, Khuong has a solution for us:

// credit: Paul Khuong
uint64_t encode_ten_thousands(uint64_t hi, uint64_t lo) {
  uint64_t merged = hi | (lo << 32);
  uint64_t top = ((merged * 10486ULL) >> 20) & ((0x7FULL << 32) | 0x7FULL);
  uint64_t bot = merged - 100ULL * top;
  uint64_t hundreds;
  uint64_t tens;
  hundreds = (bot << 16) + top;
  tens = (hundreds * 103ULL) >> 10;
  tens &= (0xFULL << 48) | (0xFULL << 32) | (0xFULL << 16) | 0xFULL;
  tens += (hundreds - 10ULL * tens) << 8;

  return tens;

void to_string_khuong(uint64_t x, char *out) {
  uint64_t top = x / 100000000;
  uint64_t bottom = x % 100000000;
  uint64_t first =
      0x3030303030303030 + encode_ten_thousands(top / 10000, top % 10000);
  memcpy(out, &first, sizeof(first));
  uint64_t second =
      0x3030303030303030 + encode_ten_thousands(bottom / 10000, bottom % 10000);
  memcpy(out + 8, &second, sizeof(second));

I refer you to Khuong’s blog post for a description.

I wrote a small benchmark in C++ which measures the time per integer. Remember that every call to my functions produces 16 digits, exactly.

function Apple M1, LLVM 12 AMD Zen 2, GCC 10
linear 14 ns 25 ns
backward linear 7.7 ns 18 ns
tree 6.9 ns 15 ns
Khuong 3.3 ns 8.0 ns
small table 3.7 ns 7.1 ns
SIMD non-available 4.8 ns
big table 1.5 ns 2.9 ns

On both processors, the crazy big-table (40kB) approach is about 2 times faster than the version with a small table. Though a big-table can be justified in some instances, my feeling is that only in niche applications would such a big table be acceptable for such a narrow task. Even a smaller 3kB seems like an overkill given the good results we get with a small table.

The SIMD approach has a rather minimal gain compared to the version with a small table (merely 25%).

At a glance, the small table wins on practical ground. It is small, simple and portable.


Science and Technology links (Novembre 13rd 2021)

  1. Pacific rougheye rockfish can live hundreds of years while other rockfish barely live past ten years.
  2. Female condors can reproduce without males. The phenomenon is known as parthenogenesis and it occurs in birds such as chickens. It does not happen in mammals naturally as far as we know.
  3. Chimpanzees can thrive in virtual reality.
  4. People have a good opinion of their own environmental impact and they are biased against the behaviour of others. In other words, we are environmental hypocrites.
  5. A majority of minimum wage workers are age 15 to 24 in Canada.
  6. Though we tend to view American businesses as uniquely innovative, Nordic countries in Europe may be just as innovative, albeit with a small population. This could be explained in part by the fact that Nordic countries have advantageous tax regimes for businesses.
  7. Smartphone vendors like to scan our content for illegal pictures. From databases of pictures, they seek to detect pictures that you own that might be visually similar. Unfortunately, it appears that the current breed of algorithms are easily fooled.
  8. Eating several eggs a day might improve your cardiovascular health.
  9. Some mice can regenerate their kidneys without scars.
  10. About half of all patents have to do with software. They are filed from a few major tech centres, unlike other patents which have more varied origins.

Checking simple equations or inequalities with z3

When programming, you sometimes need to make sure that a given formula is correct. Of course, you can rely on your mastery of high-school mathematics, but human beings, in general, are terrible at formal mathematics.

Thankfully, you can outsource simple problems to a software library. If you are a Python user, you can install z3 relatively with pip: pip install z3-solver. You are then good to go!

Suppose that you want to be sure that ( 1 + y ) / 2 < y for all 32-bit integers y. You can check it with the following Python script:

import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
if(s.check() == z3.sat):
    model = s.model()

We construct a “bit vector” spanning 32 bits to represent our integer. The z3 library considers such a number by default as a signed integer going from -2147483648 to 2147483647.

We check the reverse inequality, because we are looking for a counterexample. When no counterexample can be found, we know that our inequality holds.

Running the above script, Python prints back the integer 2863038463. How can this be given that we said that z3 assumed that we were using numbers from -2147483648 to 2147483647?  The z3 library always prints out a positive integer, and it is our job to reinterpret it: the number 2147483647 is fine, the number 2147483648 becomes -2147483648, the number 2147483649 becomes -2147483647 and so forth. This representation which “wraps around” is called two’s complement. So 2863038463 is actually interpreted as a negative value. The exact value is not too important: what matters is that our inequality (( 1 + y ) / 2 < y) is false when y is negative. It is clear enough: setting the variable to -1, I get 0 < -1.  For zero, the inequality is also false. Let me add as a condition that the variable is positive:

import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
s.add( y > 0 )

if(s.check() == z3.sat):
    model = s.model()

This time, I get 1 as a possible solution. Damn it! Let us exclude it as well:

import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
s.add( y > 1 )

if(s.check() == z3.sat):
    model = s.model()

Given this last script nothing is printed out which proves that my inequality is valid, as long as the variable is greater than 1.

Given that we have showed that ( 1 + y ) / 2 < y, then maybe ( 1 + y )  < 2 * y is true as well?

import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) >= 2 * y )
s.add( y > 1 )

if(s.check() == z3.sat):
    model = s.model()

This script returns 1412098654. Twice that value is 2824197308 which is interpreted as a negative number using two’s complement. To avoid the overflow, we can put another condition on the variable:

import z3
y = z3.BitVec("y", 32)
s = z3.Solver()
s.add( ( 1 + y ) / 2 >= y )
s.add( y > 0 )
s.add( y <  2147483647/2)

if(s.check() == z3.sat):
    model = s.model()

This time the result holds!

As you can see, it is quite a lot of work and it unlikely that anyone would be able to fully test out a non-trivial program in this manner.

Stop spending so much time being trolled by billionaire corporations!

As a kid, my parents would open the television set, and we would get to watch whatever the state television decided we would watch. It was a push model. Some experts pick the content you need and they deliver it to you. You have little say in the matter. There was one newspaper in my town. We had two or three TV channels. Traditional schools also operate on a push model: the teacher decides whatever you get to learn.

So I got to watch hours and hours of incredibly boring TV shows because there was nothing good on. I was very interested in computers and science, but there was almost nothing relevant in the major news sources. When personal computers became popular, I quickly learned more about them than any journalist had.

In the early days of the Internet, people wrote on posting boards. Some started blogs. To this day, I get much of my online news by an RSS aggregator which collects information from various blogs and news sites. An RSS aggregator simply picks up all of the news items from various sites, and it lays it out sequentially. You do not like a news source? You unsubscribe. Mailing lists work similarly: you get emails whenever someone has new content.

This model has been described as “pull” oriented. You pick your sources. You sidestep the experts. For someone like myself, it was incredibly liberating. As the pull model grew, many people feared that old-school journalism would die. It also challenged conventional education.

Within this emerging framework, Silicon Valley invented Twitter, Facebook and other networks. At first they worked much like posting boards and blogs. You launched Twitter and you got the tweets of the people you followed. You did not like someone’s tweets? You just unfollowed them. Google even co-opted the RSS reader by creating a fantastic tool called Google Reader, which put you in control.

However, mostly, the industry moved in a different direction. They took control of what you see and read. Brilliant engineers are hard at work making sure that you remain glued to your screen. So they find content that you may like and push it to you. Google closed Google Reader, decimating the RSS reader community. Whereas you could count on the Google search engine delivering the documents containing the keywords you are search, you are increasingly facing a curated list of links.

We are back at a push model which is not unlike how things were when I was a kid. The likes of Twitter, Facebook and Google feel like they get to decide what I see.

Richard Startin describes my feeling in a couple of tweets:

As the user, you are no longer in control. The corporation is wrestling back full control of what you get to watch and read.

TikTok is one such tool where you just open the application and watch whatever they want you to watch. You become some kind of automaton.

Of course, deciding what people watch and read is valuable. It is a great way to make money. But it also becomes a politically valuable power. And so, it now seems unavoidable that multiple countries in the world will regulate these sites to make sure that you watch the right “things”.

Maybe Richard Startin wants to read about what programmers have to say, but what if some corporation or some government feels that he needs to be made aware of some important bit of information, what then ?

Thankfully some tools are still leaving you in control:

    1. Blogs are still out there. I had 50,000 visitors last month. You can still use RSS readers. You can subscribe to blogs like mine by email.
    2. I have recently discovered the fantastic substack community.
    3. Telegram is pretty decent as a secured news aggregator. My blog has a telegram channel. Nobody needs to know what you are reading.
    4. Twitter has a hidden feature (twitter list) which lets you subscribe to specific individuals and only see content from these individuals.
    5. DuckDuckGo is a fantastic search engine which mostly gives me what I am looking for instead of what it thinks I should find.
    6. Do not underestimate books. Contrary to what you may have heard, you can still order paper books. The great thing about a paper book is that nobody needs to know what you are reading and when. If you like books and programming, you can grab Performance Analysis and Tuning on Modern CPUs for example. I have many recommendations on the sidebar of my blog.
    7. There are fantastic podcasts out there. Spotify has some great stuff, but you can find many others on other platforms. If you like programming, you might want to check corecursive. Joe Rogan is also fantastic. There are many others.

Being in control takes work. It also requires you to sometimes pay real money. But do you really want to have your attention is sold and manipulated?

I am not advocating that anyone leaves Twitter, Facebook or tiktok, but we should all diversify our information sources. Be conscious that Twitter, Facebook, Google and others are in the business of manipulating your attention for their interest and the interests of their clients.

Be smarter!

Related: The Social Dilemma.

Science and Technology (October 31st 2021)

  1. Though exoskeletons are exciting and they allow some of us to carry one with physical activities despite handicaps, they appear to require quite a bit of brain power. In effect, though they may help you move, they require a lot of mental effort which can be distracting.
  2. It is difficult to make people smarter by changing their environment. Socioeconomic background effects on children’s cognitive development and student achievement are likely to be spurious according to new research.
  3. There is a U-shaped association between cardiovascular health and physical activity. A moderate amount of physical activity is really good for you, but you can do too much.
  4. In richer countries where women are treated as equals to men, there are greater differences between boys’ and girls’ aspirations. For example, boys are more likely to be attracted to science compared to girls in a rich country.
  5. Scientists spend ever more time in school for relatively few desirable positions. These scientists are then ever more likely to pursue post-doctoral positions. These longer studies are a significant net loss of their lifetime income.

In C, how do you know if the dynamic allocation succeeded?

In the C programming language, we allocate memory dynamically (on the heap) using the malloc function. You pass malloc a size parameter corresponding to the number of bytes you need. The function returns either a pointer to the allocated memory or the NULL pointer if the memory could not be allocated.

Or so you may think. Let us write a program that allocates 1 terabytes of memory and then tries to write to this newly allocated memory:

#include <stdio.h>
#include <stdlib.h>

int main() {
  size_t large = 1099511627776;
  char *buffer = (char *)malloc(large);
  if (buffer == NULL) {
    return EXIT_FAILURE;
  printf("Memory allocated\n");
  for (size_t i = 0; i < large; i += 4096) {
    buffer[i] = 0;
  return EXIT_SUCCESS;

After running and compiling this program, you would expect to either get the message “error!”, in which case the program terminates immediately… or else you might expect to see “Memory allocated” (if 1 terabyte of memory is available) in which case the program should terminate successfully.

Under both macOS/clang and Linux/GCC, I find that the program prints “Memory allocated” and then crashes.

What is happening?

The malloc call does allocate memory, but it will almost surely allocate “virtual memory”. At the margin, it could be that no physical memory is allocated at all. The system merely sets aside the address space for the memory allocation. It is when you try to use the memory that the physical allocation happens.

It may then fail.

It means that checking the memory allocation was a success is probably much less useful than you might have imagined when calling malloc.

It also leads to confusion when people ask how much memory a program uses. It is wrong to add up the calls to malloc because you get the virtual memory usage. If you use a system-level tool that reports the memory usage of your processes, you should look at the real memory usage. Here is the current memory usage of a couple of processes on my laptop currently:

process memory (virtual) memory (real)
qemu 3.94 GB 32 MB
safari 3.7 GB 180 MB

Dynamic memory is allocated in pages (e.g., 4 kB) and it is often much smaller than the virtual memory. Doing “malloc(x)” is not the same as taking x bytes of physical memory.

In general, it is therefore a difficult question to know whether the allocation succeeded when relying on malloc. You may only find out when you write and read to the newly allocated memory.

In C++, is empty() faster than comparing the size with zero?

Most C++ programmers rely on “STL” for their data structures. The most popular data structure is probably vector, which is just a dynamic array. The set and the map are other useful ones.

The STL data structures are a minimalist design. You have relatively few methods. All of them allow you to compute the size of the data structure, that is, how many elements it contains, via the size() method. In recent C++ (C++11), the size() method must have constant-time complexity for all containers. To put it in clearer terms, the people implementing the containers can never scan the content to find out the number of elements.

These containers also have another method called empty() which simply returns true of the container is… well… empty. Obviously, an equivalent strategy would be to compare the size with zero:  mystruct.size() == 0.

Obviously, determining whether a data structure is empty is conceptually easier than determining its size. Thus, at least in theory, calling empty() could be faster.

Inspecting the assembly output, I find that recent versions of GCC produce nearly identical code for the comparison of the size and the empty call. The exception being the list data structure where the assembly is slightly different, but not in a manner that should affect performance.

Of course, there are different implementations of C++ and it is possible that other implementations could provide more efficient code when calling empty(). An interesting question is whether effort is needed from the compiler.

Travis Downs wrote a list data structure by hand, but with a size() function that is linear time. He then implemented the empty function naively:

struct node {
    struct node *next;
    int payload;

int count_nodes(const node* p) {
    int size = 0;
    while (p) {
        p = p->next;
    return size;

bool is_not_empty(const node* p) {
    return count_nodes(p) > 0;

Amazingly, we find that the GCC compilers is able to compile Travis’ is_not_empty C++ function to constant-time code. The compiler inlines the count_nodes function into is_empty. Then the compiler figures out that as soon as you enter the loop once with count_nodes, then size is going to be greater than zero, so there is no need to keep looping.

However, the optimisation is not robust. Suppose that I wish instead to return an unsigned type instead of Travis’ int value. The problem with an unsigned type is that I might overflow if the list is very long. With a signed integer, the compiler is allowed to assume that overflows do not happen. It could be difficult for the compiler to tell whether count_nodes() return 0 or not, if the compiler must handle overflows. To handle this potential issue, I can forcefully bound the return value of count_nodes() to be no more than 1000. If I change the code to return a standard size_t type, like so…

#include <cstddef>

struct node {
    struct node *next;
    int payload;

size_t count_nodes(const node* p) {
    size_t size = 0;
    while (p) {
        p = p->next;
        if(size == 1000) { 
            return 1000; 
    return size;

bool is_not_empty(const node* p) {
    return count_nodes(p) > 0;

Sadly, GCC is now unable to optimize away the call. Maybe compilers are not yet all powerful beings?

The lesson is that it is probably wise to get in the habit of calling directly empty() if you care about performance. Though it may not help much with modern STL data structures, in other code it could be different.

Of course, another argument is that the call to empty()  is shorter and cleaner.

Credit: This blog post was motivated by a tweet by Richard Startin.

Science and Technology links (October 23rd 2021)

    1. Apple announced new processors for its computers. Here is a table with the transistor count of some recent Apple processors:
      processor release year transistors
      Apple A7 2013 1 billions
      Apple A8 2014 2 billions
      Apple A9 2015 2 billions
      Apple A10 2016 3.2 billions
      Apple A11 2017 4.3 billions
      Apple A12 2018 6.9 billions
      Apple A13 2019 8.5 billions
      Apple A14 2020 11.8 billions
      Apple M1 2020 16 billions
      Apple M1 Max 2021 57 billions

      We find that the number of transistors on commodity processors doubles every two years, approximatively. Furthermore, the energy usage of chips is trending downward: the Apple M1 Max reportedly uses less than 100 Watts, which is comparable with processors of the last two decades despite a much greater performance.
      The storage capacity and bandwidth is also trending upward: the Apple M1 Max has 64GB of RAM with a bandwidth of 400GB/s.

      However, our processors are increasingly parallel in nature: the M1 Max has ten generic processing cores and 32 graphical cores. You are unlikely to be able to make use of the 64 GB of RAM and of all of its bandwidth with a single core.

    2. It seems that older people might be more social, more likely to give to charity. It seems to confirm earlier work.
    3. A popular supplement (nicotinamide riboside) seems to partially rejuvenate the immune system of old mice. (I do not take nicotinamide riboside.)
    4. Winter is coming. Maybe you are planning to hide from the cold? It appears that cold exposure reduces neuroinflammation. Since too much inflammation is commonly a problem when you are not otherwise sick, these results suggest that going outside in the cold might do some good.
    5. Weaker students in the US have poorer and poorer scores in reading and mathematics. The best students are still doing well. The net result is that the gap between the weak students and the best students is getting larger. (These results are unrelated to the recent pandemic.)

Converting binary floating-point numbers to integers

You are given a floating-point number, e.g. a double type in Java or C++. You would like to convert it to an integer type… but only if the conversion is exact. In other words, you want to convert the floating-point number to an integer and check if the result is exact.

In C++, you could implement such a function using few lines of code:

bool to_int64_simple(double x, int64_t *out) {
  int64_t tmp = int64_t(x);
  *out = tmp;
  return tmp == x;

The code is short in C++, and it will compile to few lines of assembly. In x64, you might get the following:

        cvttsd2si rax, xmm0
        pxor    xmm1, xmm1
        mov     edx, 0
        cvtsi2sd        xmm1, rax
        mov     QWORD PTR [rdi], rax
        ucomisd xmm1, xmm0
        setnp   al
        cmovne  eax, edx

Technically, the code might rely on undefined behaviour.

You could do it in a different manner. Instead of working with high-level instructions, you could copy your binary floating-point number to a 64-bit word and use your knowledge of the IEEE binary64 standard to extract the mantissa and the exponent.

It is much more code. It also involves pesky branches. I came up with the following routine:

typedef union {
  struct {
    uint64_t fraction : 52;
    uint32_t exp_bias : 11;
    uint32_t sign : 1;
  } fields;
  double value;
} binary64_float;

bool to_int64(double x, int64_t *out) {
  binary64_float number = {.value = x};
  const int shift = number.fields.exp_bias - 1023 - 52;
  if (shift > 0) {
    return false;
  if (shift <= -64) {
    if (x == 0) {
      *out = 0;
      return true;
    return false;
  uint64_t integer = number.fields.fraction | 0x10000000000000;
  if (integer << (64 + shift)) {
    return false;
  integer >>= -shift;
  *out = (number.fields.sign) ? -integer : integer;
  return true;

How does it compare with the simple and naive routine?

I just wrote a simple benchmark where I iterate over many floating-point numbers in sequence, and I try to do the conversion. I effectively measure the throughput and report the number of nanosecond per function call. My benchmark does not stress the branch predictor. Unsurprisingly, the naive and simple approach can be faster.

system long version simple version
ARM M1 + LLVM 12 0.95 ns/call 1.1 ns/call
Intel 7700K + LLVM 13 1.6 ns/call 1.2 ns/call

I find reassuring that the simple approach is so fast. I was concerned at first that it would be burdensome.

It is possible that my long-form routine could be improved. However, it seems unlikely that it could ever be much more efficient than the simple version.

My source code is available.

Science and Technology links (October 16th 2021)

  1. The thymus is an important component of our immune system. As we age, the thymus degenerates and our immune system becomes less fit: emotional and physical distress, malnutrition, and opportunistic bacterial and viral infections damage the thymus. New research suggests that practical thymus regeneration could be closer than ever. The thymus is able to repair itself and the right mix of signals could convince it to stay fit over time.
  2. It appears that women use sexual vocalization during penetration to increase their sexual attractiveness to their current partner by means of boosting their partner’s self-esteem.
  3. People associate creativity with effortless insight and undervalue persistence. If you just sit down and work hard, you can generate many new ideas. If you just wait for ideas to come, you will generate far fewer ideas.
  4. 63% of faculty perceive that their university does not care about the content of their publications, only where and how much is published. Such a culture can easily lead to people producing work that passes peer review but fails to contribute meaningfully to science or engineering.
  5. Redheaded women are more sexually active than other women because men are more likely to approach them with sexual intentions.

Calling a dynamically compiled function from Go

Compiled programming languages are typically much faster than interpreted programming language. Indeed, the compilation step produces “machine code” that is ideally suited for the processor. However, most programming languages today do not allow you to change the code you compiled. It means that if you find out which function you need after the code has been compiled, you have a problem.

It happens all of the time. For example, you might have a database engine that has to do some expensive processing. The expensive work might be based on a query that is only provided long after the code has been compiled and started. It means that your performance could be limited because the code that runs the query was written without sufficient knowledge of the query.

Let us illustrate the issue and a possible solution in Go (the programming language).

Suppose that your program needs to take the power of some integer. You might write a generic function with a loop, such as the following:

func Power(x int, n int) int {
	product := 1
	for i := 0; i < n; i++ {
		product *= x
	return product

However, if the power (n) is a constant, such a function is inefficient. If the constant is known at compile time, you can write an optimized function:

func Square(x int) int {
	return x * x

But what if the power is only given to you at runtime?

In Go, we are going to write out the following file (on disk):

package main

func FastFunction(x int) int {
	return x * x

The file can be created, at runtime, from with a runtime variable ‘n’ indicating the power:

	file, _ := os.Create("fast.go")
	file.WriteString("package main\nfunc FastFunction(x int) int {\n  return x")
	for i := 1; i < n; i++ {
		file.WriteString(" * x")

Importantly, the variable ‘n’ can be any integer, including an integer provided by the user, at runtime.
We then compile the new file and load the new function as a “plugin”:

	exec.Command("go", "build", "-buildmode=plugin", "-o", "", "fast.go").Output() 
        plug, _ := plugin.Open("")
	fastSquare, _ := plug.Lookup("FastFunction")
	loaded, _ := fastSquare.(func(int) int)

And that is all! The ‘loaded’ variable contains a fast and newly compiled function that I can use in the rest of my code.

Of course, writing the file on disk, compiling the result and loading the compiled function is not free. On my laptop, it takes about half a second. Yet once the function is loaded, it is appears to be as fast as the native function (see the Square function above).

I expect that Go cannot “inline” the resulting function inside the rest of your code. But that should not be a concern if your function is non-trivial.

Another downside of the approach I described is that it is fragile. It can fail for many reasons. It is a security risk. Thus you should only do it if it is absolutely necessary. If you are motivated by performance, it should offer a massive boost (e.g., 2x the performance).

Furthermore, once a plugin is loaded, it cannot be unloaded. If you were to use this method again and again, you might eventually run out of memory.

My source code is available.

Science and Technology links (October 10th 2021)

  1. Evans and Chu suggest, using data and a theoretical model, that as the number of scientists grow, progress may stagnate. Simply put, in a large field, with many researchers, a few papers and a few people are able to acquire a decisive advantage over newcomers. Large fields allow more inequality. One can review their model critically. Would you not expect large fields to fragment into smaller fields? And haven’t we seen much progress in fields that have exploded in size?
  2. Keeping your iron stores low might be important to slow your aging. Sadly, much of what you eat has been supplemented with iron because a small fraction of the population needs iron supplementation. It is widely belief that you cannot have too much iron supplementation, but, to my knowledge, long-term effects of iron supplementation have not been carefully assessed.
  3. If you lose weight while having high levels of insulin, you are more likely to lose lean tissues (e.g., muscle) than fat.
  4. A drug similar to viagra helps mice fight obesity.
  5. Age-related hair loss might be due to stem cells escaping the hair follicle bulge. This new work contradicts the commonly held belief that the stem cells die over time. This work may not relate to male-pattern baldness.
  6. People tend to stress the ability of “formal peer review” to set apart the good work from the less significant work. People greatly overestimate the value of formal peer review. They forget that much of the greatest works of science occured before formal peer review had even been considered. Cortes and Lawrence have assessed the quality of peer review at a (or “the”) major conference these days (NeurIPS). They found several years ago that when two teams of referees independently assessed the same work, they only agreed on about 50% of the assessment. They have extended their work with a new finding:

    Further, with seven years passing since the experiment we find that for accepted papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count.

    The lesson is pretty clear. Formal peer review can identify and reject the “obviously bad work”. But it is not a difficult task. In fact, I am quite certain that a fully automated system could quickly identify work that nobody should ever read. However, if you have work that has been visibly well executed, following the state-of-the-art, citing the required prior work, using the right data for the problem, and so forth… then it becomes really difficult to tell whether it is great work, or simply work that meets minimal standards. It is obvious if you know how formal peer review works. You have between two and five researchers, with various levels of familiarity with the domain of the paper, who read it over. They do not redo the experiments. They do not redo the demonstrations. Sometimes you get lucky and find a referee that has deep knowledge of the domain (often because they have worked on a similar problem) and they can be really critical. Even so, in the context of a conference peer review process, there is only so much a highly knowledgeable referee can do, since they cannot freely interact with the authors.

For software performance, can you always trust inlining?

It is easier for an optimizing compiler to spot and eliminate redundant operations if it can operate over a large block of code. Nevertheless, it is still recommended to stick with small functions. Small functions are easier to read and debug. Furthermore, you can often rely on the compiler smartly inline your small functions inside larger functions. Furthermore, if you have a just-in-time compiler (e.g., in C# and Java), the compiler may often focus its energy on functions that are called more often. Thus small functions are more likely to get optimized.

Nevertheless, you sometimes want to manually inline your code. Even the smartest compilers get it wrong. Furthermore, some optimizing compilers are simply less aggressive than others.

We have been working on a fast float-parsing library in C# called csFastFloat. Though it is already several times faster than the standard library, the primary author (Verret) wanted to boost the performance further by using SIMD instructions. SIMD instructions are fancy instructions that allow you to process multiple words at once, unlike regular instructions. The C# language allows you to use SIMD instructions to speed up your code.

He encountered a case where using SIMD instructions failed to bring the exciting new performance he was expecting. The easy way out in the end was to “manually inline” the functions. I am going to review a simplified instance because I believe it is worth discussing.

I am not going to review it in detail, but the following C# function allows you to quickly determine if 16 bytes are made of ASCII digits. In ASCII, the character ‘0’ has value 48 whereas the character ‘9’ has value 57. The function builds two “vectors” made of the values 47 and 58 (repeated 16 times) and then it does 16 comparisons at once (with CompareGreaterThan and then CompareLessThan). It concludes with a couple of instructions to check whether all conditions are met to have 16 digits.

unsafe static bool is_made_of_sixteen_digits(byte* chars) {
  Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47);
  Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58);
  Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)chars);
  var a = Sse2.CompareGreaterThan(raw, ascii0);
  var b = Sse2.CompareLessThan(raw, after_ascii9);
  var c = Sse2.Subtract(a, b); // this is not optimal   
  return (Sse41.TestZ(c,c));

Let us write a function that tries to determine if I have a string that begins with 16 digits, or 32 digits. It is a made up function that I just use for illustration purposes.

// return 2 if 32 digits are found
// return 1 if 16 digits are found
// otherwise return 1
unsafe static int ParseNumberString(byte* p, byte* pend) {
  if ((p + 16 <= pend) && is_made_of_sixteen_digits(p)) {
    if((p + 32 <= pend) && is_made_of_sixteen_digits(p + 16)) {
      return 2;
    return 1;
  return 0;

If you just write that out, C# might fail to inline the is_made_of_sixteen_digits function. Thankfully, you can tell C# that you do want it to inline the child function by labelling it with an “AggressiveInlining” attribute.

So far so good, right?

Let us look at the assembly output to see how it gets compiled. Do not worry, I do not expect you to read this gibberish. Just look at the lines that start with an arrow (“->”).

<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: lea eax, [ecx+0x10]
    L0009: cmp eax, edx
    L000b: ja short L0071
    -> L000d: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    -> L0015: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    L001d: vlddqu xmm2, [ecx]
    L0021: vpcmpgtb xmm0, xmm2, xmm0
    L0025: vpcmpgtb xmm1, xmm1, xmm2
    L0029: vpsubb xmm0, xmm0, xmm1
    L002d: vptest xmm0, xmm0
    L0032: jne short L0071
    L0034: lea eax, [ecx+0x20]
    L0037: cmp eax, edx
    L0039: ja short L006a
    -> L003b: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    -> L0043: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    L004b: vlddqu xmm2, [ecx+0x10]
    L0050: vpcmpgtb xmm0, xmm2, xmm0
    L0054: vpcmpgtb xmm1, xmm1, xmm2
    L0058: vpsubb xmm0, xmm0, xmm1
    L005c: vptest xmm0, xmm0
    L0061: jne short L006a
    L0063: mov eax, 2
    L0068: pop ebp
    L0069: ret
    L006a: mov eax, 1
    L006f: pop ebp
    L0070: ret
    L0071: xor eax, eax
    L0073: pop ebp
    L0074: ret

What C# does is to create the “vectors” made of the values 47 and 58 twice each (for a total of four times).

There might be a clever way to get C# to stop being inefficient, but you can also just manually inline. That is you create one function that includes the other function. The result might look at follows:

// return 2 if 32 digits are found
// return 1 if 16 digits are found
// otherwise return 1
unsafe static int ParseNumberStringInline(byte* p, byte* pend) {
  if (p + 16 <= pend) {
    Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47);
    Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58);    
    Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)p);
    var a = Sse2.CompareGreaterThan(raw, ascii0);
    var b = Sse2.CompareLessThan(raw, after_ascii9);
    var c = Sse2.Subtract(a, b);
    if((p + 32 <= pend) && Sse41.TestZ(c,c)){
      raw = Sse41.LoadDquVector128((sbyte*)p + 16);
      a = Sse2.CompareGreaterThan(raw, ascii0);
      b = Sse2.CompareLessThan(raw, after_ascii9);
      c = Sse2.Subtract(a, b);
      if(Sse41.TestZ(c,c)) { return 2; }
    return 1;
  return 0;

This new code is harder to read and maybe harder to maintain. However, let us look at the compiled output:

<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: lea eax, [ecx+0x10]
    L0009: cmp eax, edx
    L000b: ja short L0061
    L000d: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)]
    L0015: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)]
    L001d: vlddqu xmm2, [ecx]
    L0021: vpcmpgtb xmm3, xmm2, xmm0
    L0025: vpcmpgtb xmm2, xmm1, xmm2
    L0029: vpsubb xmm4, xmm3, xmm2
    L002d: lea eax, [ecx+0x20]
    L0030: cmp eax, edx
    L0032: ja short L005a
    L0034: vptest xmm4, xmm4
    L0039: jne short L005a
    L003b: vlddqu xmm2, [ecx+0x10]
    L0040: vpcmpgtb xmm3, xmm2, xmm0
    L0044: vpcmpgtb xmm2, xmm1, xmm2
    L0048: vpsubb xmm4, xmm3, xmm2
    L004c: vptest xmm4, xmm4
    L0051: jne short L005a
    L0053: mov eax, 2
    L0058: pop ebp
    L0059: ret
    L005a: mov eax, 1
    L005f: pop ebp
    L0060: ret
    L0061: xor eax, eax
    L0063: pop ebp
    L0064: ret

I do not expect you to read this gibberish, but notice how the result is now tighter. It is also going to be slightly faster.

As a rule of thumb, if you look at the assembly output of your code, and it is shorter, it is usually the code that you will have better performance. It is simply the case that executing fewer instructions is often faster.

Though my example is a toy example, you should expect the csFastFloat library to benefit from SIMD instructions in the near future. The preliminary numbers I have seen suggest a nice performance bump. There is a pull request.

My source code is available.

Working in virtual reality

Inspired by a post by Paul Tomlinson, I wrote my last blog post entirely in virtual reality (VR). You put on goggles and see a virtual version of your computer screen wherever you would like. Otherwise, everything around you can be anything you would like. Right now, I am floating in space, I can see the stars all around me.

Why would you ever want to work in this manner? One of the most obvious benefits is immersion. You can setup a working environment that is ideally suited for your work, with all the right screens and windows. You can be free from distractions. There may also be accessibility benefits: it gives you more freedom in how your physical body is set.

I had explored the idea of a virtual office before, but the screen resolution of my early HTC Vive made reading computer text just too painful. The hardware is better today.

I must admit that the claim that it could make your more productive is speculative at this point. However, progress depends on us being unreasonable. You need to try out crazy ideas if you hope for progress.

I had my wife take a picture:

I used an old Oculus Quest. It was my second experience. The second last blog post was written partially in VR, but my Oculus Quest ran out of power before I could complete the experience.

Like Tomlinson, I used Immersed VR as the software. I do not have any financial interest in either Immersed VR or Oculus. Here are my thoughts:

  1. It works. The hardware is really fantastic, as always, even though it is an old Quest (1.0). The software works. I had minor issues, like the application somehow becoming irresponsive, but I could easily be productive all day long.
  2. Maybe because I need to wear glasses, the experience is not as great as it could be, I feel mild pressure on my nose, but it is reportedly possible to buy vision-adjusted glasses. I am going to have to investigate.
  3. I was expecting that reading text in VR would prove difficult, but it is not. Importantly,  because you can put as many gigantic windows as you want around you, having a slightly lesser-than-perfect resolution is ok.
  4. The main difficulty is proving to be that I cannot see my keyboard. I rely on my peripheral vision to locate my fingers in relation with the keyboard. It is fine once I start typing and that I do not need to press uncommon keys. But if I need to do move my hands outside their usual typing position, I lose track of their exact position of the keyboard. I end up frequently hitting the wrong key. As you can see in the picture, I am using a wireless keyboard and a wireless mice. Working directly on the laptop proved too annoying because it has a flat surface that makes it hard for me to locate the keyboard. At least, with a standalone keyboard, I can feel the exact location of the keyboard. I suspect that changing the keyboard type could help. I tried with a gaming (mechanical) keyboard and it seemed to work better. There is a way to make a virtual version of your keyboard appear in virtual reality but, so far, it did not help me enough. For now, I just put some stickers on the keyboard to help my fingers, but it only helps so much.
  5. Whenever I take off the Quest and put it back, my windows have moved. Since it also moves the virtual keyboard, it makes the virtual keyboard much less interesting. I ended up not using it. I try hard to avoid having to take off the goggles while I work.
  6. Power usage is a concern. The Quest is standalone but it has a small battery and limited autonomy. So I have a small power cable connected to it.
  7. The Quest takes a few seconds to boot up. The application (Immersed VR) also takes a few seconds. You cannot just shut the Quest down and resume from your last work session.
  8. Many of us work in videoconference. Clearly, nobody wants to watch me wearing  a VR headset. There is reportedly a virtual webcam. I have yet to try it.

Science and Technology links (October 3rd 2021)

  1. Most people were able to cure their diabetes by losing weight in a clinical trial.
  2. Video games improve intelligence over many years, while socializing has no effect.
  3. To go to Mars safely, time is of the essence because astronauts would be exposed to radiations and particles from outside of solar system.
  4. It is unlikely that there is an upper bound to human longevity under 130 years. Though men have a drastically reduced longevity compared to women, it appears that once you reach 108 years, there may not be any gender difference.
  5. Recent progress having to do with protein folding can be largely attributed to better hardware and more data, and not really to a better understanding of protein folding. The same could be said for natural language processing. (Speculative)
  6. The children of the survivors of Chernobyl do not have excess mutations. It is consistent with a worldview that radiations are less harmful to our biology than commonly assumed.
  7. Researchers have built flexible glass that is much stronger and flexible. It does not shatter easily.

Word-aligned Bloom filters

Programmers often need to ‘filter out’ data. Suppose that you are given a database of users where only a small percentage are ‘paying customers’ (say 5% or less). You can write an SQL query to check whether a given user is indeed a paying customer, but it might require a round trip to your database engine. Instead, you would like to hold a small ‘filter’ in memory to quickly check whether the given user is a paying customer. When you are pretty sure it might be a paying customer, then, and only then, do you query the database.

Probabilistic filters are ideally suited for such tasks. They return “true” when the key might be found, and “false” when the key definitively cannot be found. A probabilistic filter can be fast and concise.

The conventional probabilistic filter is the Bloom filter. We construct a Bloom filter as follows. We start with an array of bits. Initially, all of the bits are set to 0. When a new value is added to the filter, we map it to several “random” locations in the array of bit. All of the bits at the matching locations are set to 1. The random mapping is done using “hash functions”. A hash function is a function that returns “random-looking” values: it is nevertheless a genuine function. That is, a given hash function always return the same output given the same input.

When querying a Bloom filter, we hash the received value to several locations in the array of bits. If all of the corresponding bits are set to 1, then we have that value might be in the corresponding set. If only just one of these bit values is 0, then we know that the value was not previous added to the set.

A Bloom filter can be remarkably efficient. But can you do better? There are many other probabilistic filters, but let us say that you want to remain close to Bloom filters.

One way to improve on the Bloom filter is to avoid mapping your values all over the array of bits. Instead, you might first map the value to a small block of bits, and then check the bits within a limited range. It can make processing much faster because you do not need multiple data loads in an array. This approach is known under the name of “blocked Bloom filter”. It is much less popular than conventional Bloom filters. One reason why it might be less popular is that blocked Bloom filters require more storage.

I suspect that another reason people might prefer Bloom filters to alternatives such as blocked Bloom filters has to do with implementation. Bloom filters are mature and it is easy to pick up a library. E.g., there is a mature Bloom filter library in Go.

What is the simplest blocked Bloom filter you could do? Instead of doing something fanciful, let us say that the “block” is just a single machine word. On current hardware, a machine word span 64 bits. You have wider registers for SIMD operations, but let us keep things simple. I am going to take my values and map them to a 64-bit word, and I will set a number of bits to 1 within this word, and only within this word. Such a word might be called a fingerprint. Thus, at query time, I will only need to check a single word and compare it with the fingerprint (a bitwise AND followed by a compare). It should be really fast and easy to implement.

Probabilistic filters, as their name implies, have a small probability of getting it wrong: they can claim that a value is in the set while it is not. We call this error margin the “false-positive rate”. You might think that you want this error to be as small as possible, but there is a tradeoff. For a false-positive rate of 1/2p, you need at least p bits per entries. Conventional Bloom filters have an overhead of 44% in the best case scenario, so you actually need 1.44 p bits per entries.

A reasonable choice might be to pick a false-positive rate of 1%. To achieve such a low rate, you need at least 10 bits per entry with a conventional Bloom filter.

It turns out that I can achieve roughly the same false-positive rate while using 2 extra bits per entry for a total of 12 bits per entry. Assume that you take your input values and hash them to a random-looking 64-bit outputs. From such 64-bit ouputs, you can select a location and generate a 64-bit word with 5 bits set. To do so, I can just select 5 bits locations in [0,64), using 6 of the input bits per location. I can create the word using a function such as this one…

std::function<uint64_t(uint64_t)> fingerprint5 = [](uint64_t x) { 
       return (uint64_t(1) << (x&63)) 
            | (uint64_t(1) << ((x>>6)&63)) 
            | (uint64_t(1) << ((x>>12)&63)) 
            | (uint64_t(1) << ((x>>18)&63))
            | (uint64_t(1) << ((x>>24)&63));

Though it is in C++, the concept is portable to most languages. You might be concerned that such code could be inefficient. Yet the LLVM clang compiler has no trouble producing code that looks efficient (x64 assembly):

        shr     rcx, 6
        mov     eax, 1
        shl     rax, cl
        bts     rax, rdx
        mov     rcx, rdx
        shr     rcx, 12
        bts     rax, rcx
        mov     rcx, rdx
        shr     rcx, 18
        bts     rax, rcx
        shr     rdx, 24
        bts     rax, rdx

Other compilers could possibly be less efficient. Checking for a match is as simple as the following pseudocode:

    uint64_t print = fingerprint(key);
    if( (buffer[location(key,buffer.size())] & print) == print ) {
       // match

I did not write a benchmark yet, but my expectation is that such a word-based Bloom filter would provide excellent performance.

I have published prototypical code on GitHub. In the future, I will provide some benchmarks. There is also a natural extension of this idea where instead of checking a single word, you pick two or more words.

Important: I do not claim that this is a “new” idea. I merely feel that it is an under-appreciated one.

Science and Technology links (September 26th 2021)

  1. Radiation-therapy can rejuvenate heart cells. (source: Nature)
  2. Within members of the same species, cancer risk increases with body size. Large human beings are more at risk than smaller ones. Across species, the reverse is often true: cancer risks decrease with body size. Elephants are much less at risk of cancer than mice.
    We do not know why it works this way: it is called Peto’s paradox. (Credit: John D Cook)
  3. Large minimum wage increases reduced employment rates among low-skilled individuals by just over 2.5 percentage points.
  4. Researchers claim to have found a way to rejuvenate an important part of the immune system using hormonal signals. It is not yet in vivo (nobody was rejuvenated with this technique).
  5. Apple makes it easy to carry your electronic vaccination information with you.
  6. We have a processor for artificial intelligence containing 2.6 trillion transistors. It is many more transistors than we have neurons in the human brain.
  7. Though most native Americans can be traced back to Asians who came to America 15,000 years ago, we have increasing evidence of human life in America far older, going back 23,000 years ago. What happened to these human beings and how they came to America is a mystery.

New release of the simdjson library: version 1.0

The most popular data format on the web is arguably JSON. It is a simple and convenient format. Most web services allow to send and receive data in JSON.

Unfortunately, parsing JSON can be time and energy consuming. Back in 2019, we released the simdjson library. It broke speed records and it is still one of the most efficient and fast JSON parsing library. It makes few compromises. It provides exact float parsing, exact unicode validation and so forth.

An independent benchmark compares it with other fast C++ libraries and demonstrates that it can use far less energy.

simdjson 2.1 J
RapidJSON 6.8 J
C++ for Modern C++ 41 J

We have recently released version 1.0. It took us two years to get at that point. It is an important step for us.

There are many ways a library can give you access to the JSON data. A convenient approach is the DOM tree (DOM stands for document object model). In a DOM-based approach, the document is parsed entirely and materialized in an in-memory construction. For some applications, it is the right model, but in other instances, it is a wasteful step.

We also have streaming-based approaches. In such approaches, you have an event-based interface where the library calls user-provided function when encountering different elements. Though it can be highly efficient, in part because it sidesteps the need to construct a DOM tree, it is a challenging programming paradigm.

Another approach is a simple serialization-deserialization. You provide a native data structure and you ask the library to either write it out in JSON or to turn the JSON into your data structure. It is often a great model. However, it has limited flexibility.

In simdjson, we are proposing a new approach which we call On Demand. The On Demand approach feels like a DOM approach, but it sidesteps the construction of the DOM tree. It is entirely lazy: it decodes only the parts of the document that you access.

With On Demand, you can write clean code as follows:

#include <iostream>
#include "simdjson.h"
using namespace simdjson;
int main(void) {
    ondemand::parser parser;
    padded_string json = padded_string::load("twitter.json");
    ondemand::document tweets = parser.iterate(json);
    std::cout << uint64_t(tweets["search_metadata"]["count"]) << " results." << std::endl;

In such an example, the library accesses only the content that you require, doing only minimal validation and indexing of the whole document.

With On Demand, if you open a file containing 1000 numbers and you need just one of these numbers, only one number is parsed. If you need to put the numbers into your own data structure, they are materialized there directly, without being first written to a temporary tree. Thus we expect that the simdjson On Demand might often provide superior performance, when you do not need to materialize a DOM tree. The On Demand front-end was primarily developed by John Keiser.

In release 1.0 of the simdjson library, the On Demand frontend is our default though we also support a DOM-based approach.

Release 1.0 adds several key features:

  1. In big data analytics, it is common to serialize large sets of records as multiple JSON documents separated by white spaces. You can now get the benefits of On Demand while parsing almost infinitely long streams of JSON records. At each step, you have access to the current document, but a secondary thread indexes the following block. You can thus access enormous files while using a small amount of memory and achieve record-breaking speeds.
  2. Given an On Demand instance (value, array, object, etc.), you can now convert it to a JSON string using the to_json_string method which returns a string view in the original document for unbeatable speeds.
  3. The On Demand front-end now supports the JSON Pointer specification. You can request a specific value using a JSON Pointer within a large document.

The release 1.0 is robust. We have extended and improved our documentation. We have added much testing.

The simdjson library is the result of the work of many people. I would like to thank Nicolas Boyer for working with me over the summer on finishing this version.

You can find simdjson on GitHub. You can use it by adding two files to your project (simdjson.h and simdjson.cpp), or as a CMake dependency or using many popular package managers.