Science and Technology links (October 16th 2021)

  1. The thymus is an important component of our immune system. As we age, the thymus degenerates and our immune system becomes less fit: emotional and physical distress, malnutrition, and opportunistic bacterial and viral infections damage the thymus. New research suggests that practical thymus regeneration could be closer than ever. The thymus is able to repair itself and the right mix of signals could convince it to stay fit over time.
  2. It appears that women use sexual vocalization during penetration to increase their sexual attractiveness to their current partner by means of boosting their partner’s self-esteem.
  3. People associate creativity with effortless insight and undervalue persistence. If you just sit down and work hard, you can generate many new ideas. If you just wait for ideas to come, you will generate far fewer ideas.
  4. 63% of faculty perceive that their university does not care about the content of their publications, only where and how much is published. Such a culture can easily lead to people producing work that passes peer review but fails to contribute meaningfully to science or engineering.
  5. Redheaded women are more sexually active than other women because men are more likely to approach them with sexual intentions.

Calling a dynamically compiled function from Go

Compiled programming languages are typically much faster than interpreted programming language. Indeed, the compilation step produces “machine code” that is ideally suited for the processor. However, most programming languages today do not allow you to change the code you compiled. It means that if you find out which function you need after the code has been compiled, you have a problem.

It happens all of the time. For example, you might have a database engine that has to do some expensive processing. The expensive work might be based on a query that is only provided long after the code has been compiled and started. It means that your performance could be limited because the code that runs the query was written without sufficient knowledge of the query.

Let us illustrate the issue and a possible solution in Go (the programming language).

Suppose that your program needs to take the power of some integer. You might write a generic function with a loop, such as the following:

func Power(x int, n int) int {
	product := 1
	for i := 0; i < n; i++ {
		product *= x
	}
	return product
}

However, if the power (n) is a constant, such a function is inefficient. If the constant is known at compile time, you can write an optimized function:

func Square(x int) int {
	return x * x
}

But what if the power is only given to you at runtime?

In Go, we are going to write out the following file (on disk):

package main

func FastFunction(x int) int {
	return x * x
}

The file can be created, at runtime, from with a runtime variable ‘n’ indicating the power:

	file, _ := os.Create("fast.go")
	file.WriteString("package main\nfunc FastFunction(x int) int {\n  return x")
	for i := 1; i < n; i++ {
		file.WriteString(" * x")
	}
	file.WriteString("\n}\n")
        file.Close()

Importantly, the variable ‘n’ can be any integer, including an integer provided by the user, at runtime.
We then compile the new file and load the new function as a “plugin”:

	exec.Command("go", "build", "-buildmode=plugin", "-o", "fast.so", "fast.go").Output() 
        plug, _ := plugin.Open("fast.so")
	fastSquare, _ := plug.Lookup("FastFunction")
	loaded, _ := fastSquare.(func(int) int)

And that is all! The ‘loaded’ variable contains a fast and newly compiled function that I can use in the rest of my code.

Of course, writing the file on disk, compiling the result and loading the compiled function is not free. On my laptop, it takes about half a second. Yet once the function is loaded, it is appears to be as fast as the native function (see the Square function above).

I expect that Go cannot “inline” the resulting function inside the rest of your code. But that should not be a concern if your function is non-trivial.

Another downside of the approach I described is that it is fragile. It can fail for many reasons. It is a security risk. Thus you should only do it if it is absolutely necessary. If you are motivated by performance, it should offer a massive boost (e.g., 2x the performance).

Furthermore, once a plugin is loaded, it cannot be unloaded. If you were to use this method again and again, you might eventually run out of memory.

My source code is available.

Science and Technology links (October 10th 2021)

  1. Evans and Chu suggest, using data and a theoretical model, that as the number of scientists grow, progress may stagnate. Simply put, in a large field, with many researchers, a few papers and a few people are able to acquire a decisive advantage over newcomers. Large fields allow more inequality. One can review their model critically. Would you not expect large fields to fragment into smaller fields? And haven’t we seen much progress in fields that have exploded in size?
  2. Keeping your iron stores low might be important to slow your aging. Sadly, much of what you eat has been supplemented with iron because a small fraction of the population needs iron supplementation. It is widely belief that you cannot have too much iron supplementation, but, to my knowledge, long-term effects of iron supplementation have not been carefully assessed.
  3. If you lose weight while having high levels of insulin, you are more likely to lose lean tissues (e.g., muscle) than fat.
  4. A drug similar to viagra helps mice fight obesity.
  5. Age-related hair loss might be due to stem cells escaping the hair follicle bulge. This new work contradicts the commonly held belief that the stem cells die over time. This work may not relate to male-pattern baldness.
  6. People tend to stress the ability of “formal peer review” to set apart the good work from the less significant work. People greatly overestimate the value of formal peer review. They forget that much of the greatest works of science occured before formal peer review had even been considered. Cortes and Lawrence have assessed the quality of peer review at a (or “the”) major conference these days (NeurIPS). They found several years ago that when two teams of referees independently assessed the same work, they only agreed on about 50% of the assessment. They have extended their work with a new finding:

    Further, with seven years passing since the experiment we find that for accepted papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count.

    The lesson is pretty clear. Formal peer review can identify and reject the “obviously bad work”. But it is not a difficult task. In fact, I am quite certain that a fully automated system could quickly identify work that nobody should ever read. However, if you have work that has been visibly well executed, following the state-of-the-art, citing the required prior work, using the right data for the problem, and so forth… then it becomes really difficult to tell whether it is great work, or simply work that meets minimal standards. It is obvious if you know how formal peer review works. You have between two and five researchers, with various levels of familiarity with the domain of the paper, who read it over. They do not redo the experiments. They do not redo the demonstrations. Sometimes you get lucky and find a referee that has deep knowledge of the domain (often because they have worked on a similar problem) and they can be really critical. Even so, in the context of a conference peer review process, there is only so much a highly knowledgeable referee can do, since they cannot freely interact with the authors.

For software performance, can you always trust inlining?

It is easier for an optimizing compiler to spot and eliminate redundant operations if it can operate over a large block of code. Nevertheless, it is still recommended to stick with small functions. Small functions are easier to read and debug. Furthermore, you can often rely on the compiler smartly inline your small functions inside larger functions. Furthermore, if you have a just-in-time compiler (e.g., in C# and Java), the compiler may often focus its energy on functions that are called more often. Thus small functions are more likely to get optimized.

Nevertheless, you sometimes want to manually inline your code. Even the smartest compilers get it wrong. Furthermore, some optimizing compilers are simply less aggressive than others.

We have been working on a fast float-parsing library in C# called csFastFloat. Though it is already several times faster than the standard library, the primary author (Verret) wanted to boost the performance further by using SIMD instructions. SIMD instructions are fancy instructions that allow you to process multiple words at once, unlike regular instructions. The C# language allows you to use SIMD instructions to speed up your code.

He encountered a case where using SIMD instructions failed to bring the exciting new performance he was expecting. The easy way out in the end was to “manually inline” the functions. I am going to review a simplified instance because I believe it is worth discussing.

I am not going to review it in detail, but “the following C# function allows you to quickly determine if 16 bytes are made of ASCII digits. In ASCII, the character ‘0’ has value 48 whereas the character ‘9’ has value 57. The function builds two “vectors” made of the values 47 and 58 (repeated 16 times) and then it does 16 comparisons at once (with CompareGreaterThan and then CompareLessThan). It concludes with a couple of instructions to check whether all conditions are met to have 16 digits.

unsafe static bool is_made_of_sixteen_digits(byte* chars) {
  Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47);
  Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58);
  Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)chars);
  var a = Sse2.CompareGreaterThan(raw, ascii0);
  var b = Sse2.CompareLessThan(raw, after_ascii9);
  var c = Sse2.Subtract(a, b); // this is not optimal   
  return (Sse41.TestZ(c,c));
}

Let us write a function that tries to determine if I have a string that begins with 16 digits, or 32 digits. It is a made up function that I just use for illustration purposes.

// return 2 if 32 digits are found
// return 1 if 16 digits are found
// otherwise return 1
unsafe static int ParseNumberString(byte* p, byte* pend) {
  if ((p + 16 <= pend) && is_made_of_sixteen_digits(p)) {
    if((p + 32 <= pend) && is_made_of_sixteen_digits(p + 16)) {
      return 2;
    }
    return 1;
  }
  return 0;
}

If you just write that out, C# might fail to inline the is_made_of_sixteen_digits function. Thankfully, you can tell C# that you do want it to inline the child function by labelling it with an “AggressiveInlining” attribute.

So far so good, right?

Let us look at the assembly output to see how it gets compiled. Do not worry, I do not expect you to read this gibberish. Just look at the lines that start with an arrow (“->”).

<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: lea eax, [ecx+0x10]
    L0009: cmp eax, edx
    L000b: ja short L0071
    -> L000d: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    -> L0015: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    L001d: vlddqu xmm2, [ecx]
    L0021: vpcmpgtb xmm0, xmm2, xmm0
    L0025: vpcmpgtb xmm1, xmm1, xmm2
    L0029: vpsubb xmm0, xmm0, xmm1
    L002d: vptest xmm0, xmm0
    L0032: jne short L0071
    L0034: lea eax, [ecx+0x20]
    L0037: cmp eax, edx
    L0039: ja short L006a
    -> L003b: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    -> L0043: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)]
    L004b: vlddqu xmm2, [ecx+0x10]
    L0050: vpcmpgtb xmm0, xmm2, xmm0
    L0054: vpcmpgtb xmm1, xmm1, xmm2
    L0058: vpsubb xmm0, xmm0, xmm1
    L005c: vptest xmm0, xmm0
    L0061: jne short L006a
    L0063: mov eax, 2
    L0068: pop ebp
    L0069: ret
    L006a: mov eax, 1
    L006f: pop ebp
    L0070: ret
    L0071: xor eax, eax
    L0073: pop ebp
    L0074: ret

What C# does is to create the “vectors” made of the values 47 and 58 twice each (for a total of four times).

There might be a clever way to get C# to stop being inefficient, but you can also just manually inline. That is you create one function that includes the other function. The result might look at follows:

// return 2 if 32 digits are found
// return 1 if 16 digits are found
// otherwise return 1
unsafe static int ParseNumberStringInline(byte* p, byte* pend) {
  if (p + 16 <= pend) {
    Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47);
    Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58);    
    Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)p);
    var a = Sse2.CompareGreaterThan(raw, ascii0);
    var b = Sse2.CompareLessThan(raw, after_ascii9);
    var c = Sse2.Subtract(a, b);
    if((p + 32 <= pend) && Sse41.TestZ(c,c)){
      raw = Sse41.LoadDquVector128((sbyte*)p + 16);
      a = Sse2.CompareGreaterThan(raw, ascii0);
      b = Sse2.CompareLessThan(raw, after_ascii9);
      c = Sse2.Subtract(a, b);
      if(Sse41.TestZ(c,c)) { return 2; }
    }
    return 1;
  }
  return 0;
}

This new code is harder to read and maybe harder to maintain. However, let us look at the compiled output:

<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: lea eax, [ecx+0x10]
    L0009: cmp eax, edx
    L000b: ja short L0061
    L000d: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)]
    L0015: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)]
    L001d: vlddqu xmm2, [ecx]
    L0021: vpcmpgtb xmm3, xmm2, xmm0
    L0025: vpcmpgtb xmm2, xmm1, xmm2
    L0029: vpsubb xmm4, xmm3, xmm2
    L002d: lea eax, [ecx+0x20]
    L0030: cmp eax, edx
    L0032: ja short L005a
    L0034: vptest xmm4, xmm4
    L0039: jne short L005a
    L003b: vlddqu xmm2, [ecx+0x10]
    L0040: vpcmpgtb xmm3, xmm2, xmm0
    L0044: vpcmpgtb xmm2, xmm1, xmm2
    L0048: vpsubb xmm4, xmm3, xmm2
    L004c: vptest xmm4, xmm4
    L0051: jne short L005a
    L0053: mov eax, 2
    L0058: pop ebp
    L0059: ret
    L005a: mov eax, 1
    L005f: pop ebp
    L0060: ret
    L0061: xor eax, eax
    L0063: pop ebp
    L0064: ret

I do not expect you to read this gibberish, but notice how the result is now tighter. It is also going to be slightly faster.

As a rule of thumb, if you look at the assembly output of your code, and it is shorter, it is usually the code that you will have better performance. It is simply the case that executing fewer instructions is often faster.

Though my example is a toy example, you should expect the csFastFloat library to benefit from SIMD instructions in the near future. The preliminary numbers I have seen suggest a nice performance bump. There is a pull request.

My source code is available.

Working in virtual reality

Inspired by a post by Paul Tomlinson, I wrote my last blog post entirely in virtual reality (VR). You put on goggles and see a virtual version of your computer screen wherever you would like. Otherwise, everything around you can be anything you would like. Right now, I am floating in space, I can see the stars all around me.

Why would you ever want to work in this manner? One of the most obvious benefits is immersion. You can setup a working environment that is ideally suited for your work, with all the right screens and windows. You can be free from distractions. There may also be accessibility benefits: it gives you more freedom in how your physical body is set.

I had explored the idea of a virtual office before, but the screen resolution of my early HTC Vive made reading computer text just too painful. The hardware is better today.

I must admit that the claim that it could make your more productive is speculative at this point. However, progress depends on us being unreasonable. You need to try out crazy ideas if you hope for progress.

I had my wife take a picture:

I used an old Oculus Quest. It was my second experience. The second last blog post was written partially in VR, but my Oculus Quest ran out of power before I could complete the experience.

Like Tomlinson, I used Immersed VR as the software. I do not have any financial interest in either Immersed VR or Oculus. Here are my thoughts:

  1. It works. The hardware is really fantastic, as always, even though it is an old Quest (1.0). The software works. I had minor issues, like the application somehow becoming irresponsive, but I could easily be productive all day long.
  2. Maybe because I need to wear glasses, the experience is not as great as it could be, I feel mild pressure on my nose, but it is reportedly possible to buy vision-adjusted glasses. I am going to have to investigate.
  3. I was expecting that reading text in VR would prove difficult, but it is not. Importantly,  because you can put as many gigantic windows as you want around you, having a slightly lesser-than-perfect resolution is ok.
  4. The main difficulty is proving to be that I cannot see my keyboard. I rely on my peripheral vision to locate my fingers in relation with the keyboard. It is fine once I start typing and that I do not need to press uncommon keys. But if I need to do move my hands outside their usual typing position, I lose track of their exact position of the keyboard. I end up frequently hitting the wrong key. As you can see in the picture, I am using a wireless keyboard and a wireless mice. Working directly on the laptop proved too annoying because it has a flat surface that makes it hard for me to locate the keyboard. At least, with a standalone keyboard, I can feel the exact location of the keyboard. I suspect that changing the keyboard type could help. I tried with a gaming (mechanical) keyboard and it seemed to work better. There is a way to make a virtual version of your keyboard appear in virtual reality but, so far, it did not help me enough. For now, I just put some stickers on the keyboard to help my fingers, but it only helps so much.
  5. Whenever I take off the Quest and put it back, my windows have moved. Since it also moves the virtual keyboard, it makes the virtual keyboard much less interesting. I ended up not using it. I try hard to avoid having to take off the goggles while I work.
  6. Power usage is a concern. The Quest is standalone but it has a small battery and limited autonomy. So I have a small power cable connected to it.
  7. The Quest takes a few seconds to boot up. The application (Immersed VR) also takes a few seconds. You cannot just shut the Quest down and resume from your last work session.
  8. Many of us work in videoconference. Clearly, nobody wants to watch me wearing  a VR headset. There is reportedly a virtual webcam. I have yet to try it.

Science and Technology links (October 3rd 2021)

  1. Most people were able to cure their diabetes by losing weight in a clinical trial.
  2. Video games improve intelligence over many years, while socializing has no effect.
  3. To go to Mars safely, time is of the essence because astronauts would be exposed to radiations and particles from outside of solar system.
  4. It is unlikely that there is an upper bound to human longevity under 130 years. Though men have a drastically reduced longevity compared to women, it appears that once you reach 108 years, there may not be any gender difference.
  5. Recent progress having to do with protein folding can be largely attributed to better hardware and more data, and not really to a better understanding of protein folding. The same could be said for natural language processing. (Speculative)
  6. The children of the survivors of Chernobyl do not have excess mutations. It is consistent with a worldview that radiations are less harmful to our biology than commonly assumed.
  7. Researchers have built flexible glass that is much stronger and flexible. It does not shatter easily.

Word-aligned Bloom filters

rogrammers often need to ‘filter out’ data. Suppose that you are given a database of users where only a small percentage are ‘paying customers’ (say 5% or less). You can write an SQL query to check whether a given user is indeed a paying customer, but it might require a round trip to your database engine. Instead, you would like to hold a small ‘filter’ in memory to quickly check whether the given user is a paying customer. When you are pretty sure it might be a paying customer, then, and only then, do you query the database.

Probabilistic filters are ideally suited for such tasks. They return “true” when the key might be found, and “false” when the key definitively cannot be found. A probabilistic filter can be fast and concise.

The conventional probabilistic filter is the Bloom filter. We construct a Bloom filter as follows. We start with an array of bits. Initially, all of the bits are set to 0. When a new value is added to the filter, we map it to several “random” locations in the array of bit. All of the bits at the matching locations are set to 1. The random mapping is done using “hash functions”. A hash function is a function that returns “random-looking” values: it is nevertheless a genuine function. That is, a given hash function always return the same output given the same input.

When querying a Bloom filter, we hash the received value to several locations in the array of bits. If all of the corresponding bits are set to 1, then we have that value might be in the corresponding set. If only just one of these bit values is 0, then we know that the value was not previous added to the set.

A Bloom filter can be remarkably efficient. But can you do better? There are many other probabilistic filters, but let us say that you want to remain close to Bloom filters.

One way to improve on the Bloom filter is to avoid mapping your values all over the array of bits. Instead, you might first map the value to a small block of bits, and then check the bits within a limited range. It can make processing much faster because you do not need multiple data loads in an array. This approach is known under the name of “blocked Bloom filter”. It is much less popular than conventional Bloom filters. One reason why it might be less popular is that blocked Bloom filters require more storage.

I suspect that another reason people might prefer Bloom filters to alternatives such as blocked Bloom filters has to do with implementation. Bloom filters are mature and it is easy to pick up a library. E.g., there is a mature Bloom filter library in Go.

What is the simplest blocked Bloom filter you could do? Instead of doing something fanciful, let us say that the “block” is just a single machine word. On current hardware, a machine word span 64 bits. You have wider registers for SIMD operations, but let us keep things simple. I am going to take my values and map them to a 64-bit word, and I will set a number of bits to 1 within this word, and only within this word. Such a word might be called a fingerprint. Thus, at query time, I will only need to check a single word and compare it with the fingerprint (a bitwise AND followed by a compare). It should be really fast and easy to implement.

Probabilistic filters, as their name implies, have a small probability of getting it wrong: they can claim that a value is in the set while it is not. We call this error margin the “false-positive rate”. You might think that you want this error to be as small as possible, but there is a tradeoff. For a false-positive rate of 1/2p, you need at least p bits per entries. Conventional Bloom filters have an overhead of 44% in the best case scenario, so you actually need 1.44 p bits per entries.

A reasonable choice might be to pick a false-positive rate of 1%. To achieve such a low rate, you need at least 10 bits per entry with a conventional Bloom filter.

It turns out that I can achieve roughly the same false-positive rate while using 2 extra bits per entry for a total of 12 bits per entry. Assume that you take your input values and hash them to a random-looking 64-bit outputs. From such 64-bit ouputs, you can select a location and generate a 64-bit word with 5 bits set. To do so, I can just select 5 bits locations in [0,64), using 6 of the input bits per location. I can create the word using a function such as this one…

std::function<uint64_t(uint64_t)> fingerprint5 = [](uint64_t x) { 
       return (uint64_t(1) << (x&63)) 
            | (uint64_t(1) << ((x>>6)&63)) 
            | (uint64_t(1) << ((x>>12)&63)) 
            | (uint64_t(1) << ((x>>18)&63))
            | (uint64_t(1) << ((x>>24)&63));
    };

Though it is in C++, the concept is portable to most languages. You might be concerned that such code could be inefficient. Yet the LLVM clang compiler has no trouble producing code that looks efficient (x64 assembly):

        shr     rcx, 6
        mov     eax, 1
        shl     rax, cl
        bts     rax, rdx
        mov     rcx, rdx
        shr     rcx, 12
        bts     rax, rcx
        mov     rcx, rdx
        shr     rcx, 18
        bts     rax, rcx
        shr     rdx, 24
        bts     rax, rdx

Other compilers could possibly be less efficient. Checking for a match is as simple as the following pseudocode:

    uint64_t print = fingerprint(key);
    if( (buffer[location(key,buffer.size())] & print) == print ) {
       // match
    }

I did not write a benchmark yet, but my expectation is that such a word-based Bloom filter would provide excellent performance.

I have published prototypical code on GitHub. In the future, I will provide some benchmarks. There is also a natural extension of this idea where instead of checking a single word, you pick two or more words.

Important: I do not claim that this is a “new” idea. I merely feel that it is an under-appreciated one.

Science and Technology links (September 26th 2021)

  1. Radiation-therapy can rejuvenate heart cells. (source: Nature)
  2. Within members of the same species, cancer risk increases with body size. Large human beings are more at risk than smaller ones. Across species, the reverse is often true: cancer risks decrease with body size. Elephants are much less at risk of cancer than mice.
    We do not know why it works this way: it is called Peto’s paradox. (Credit: John D Cook)
  3. Large minimum wage increases reduced employment rates among low-skilled individuals by just over 2.5 percentage points.
  4. Researchers claim to have found a way to rejuvenate an important part of the immune system using hormonal signals. It is not yet in vivo (nobody was rejuvenated with this technique).
  5. Apple makes it easy to carry your electronic vaccination information with you.
  6. We have a processor for artificial intelligence containing 2.6 trillion transistors. It is many more transistors than we have neurons in the human brain.
  7. Though most native Americans can be traced back to Asians who came to America 15,000 years ago, we have increasing evidence of human life in America far older, going back 23,000 years ago. What happened to these human beings and how they came to America is a mystery.

New release of the simdjson library: version 1.0

The most popular data format on the web is arguably JSON. It is a simple and convenient format. Most web services allow to send and receive data in JSON.

Unfortunately, parsing JSON can be time and energy consuming. Back in 2019, we released the simdjson library. It broke speed records and it is still one of the most efficient and fast JSON parsing library. It makes few compromises. It provides exact float parsing, exact unicode validation and so forth.

An independent benchmark compares it with other fast C++ libraries and demonstrates that it can use far less energy.

simdjson 2.1 J
RapidJSON 6.8 J
C++ for Modern C++ 41 J

We have recently released version 1.0. It took us two years to get at that point. It is an important step for us.

There are many ways a library can give you access to the JSON data. A convenient approach is the DOM tree (DOM stands for document object model). In a DOM-based approach, the document is parsed entirely and materialized in an in-memory construction. For some applications, it is the right model, but in other instances, it is a wasteful step.

We also have streaming-based approaches. In such approaches, you have an event-based interface where the library calls user-provided function when encountering different elements. Though it can be highly efficient, in part because it sidesteps the need to construct a DOM tree, it is a challenging programming paradigm.

Another approach is a simple serialization-deserialization. You provide a native data structure and you ask the library to either write it out in JSON or to turn the JSON into your data structure. It is often a great model. However, it has limited flexibility.

In simdjson, we are proposing a new approach which we call On Demand. The On Demand approach feels like a DOM approach, but it sidesteps the construction of the DOM tree. It is entirely lazy: it decodes only the parts of the document that you access.

With On Demand, you can write clean code as follows:

#include <iostream>
#include "simdjson.h"
using namespace simdjson;
int main(void) {
    ondemand::parser parser;
    padded_string json = padded_string::load("twitter.json");
    ondemand::document tweets = parser.iterate(json);
    std::cout << uint64_t(tweets["search_metadata"]["count"]) << " results." << std::endl;
}

In such an example, the library accesses only the content that you require, doing only minimal validation and indexing of the whole document.

With On Demand, if you open a file containing 1000 numbers and you need just one of these numbers, only one number is parsed. If you need to put the numbers into your own data structure, they are materialized there directly, without being first written to a temporary tree. Thus we expect that the simdjson On Demand might often provide superior performance, when you do not need to materialize a DOM tree. The On Demand front-end was primarily developed by John Keiser.

In release 1.0 of the simdjson library, the On Demand frontend is our default though we also support a DOM-based approach.

Release 1.0 adds several key features:

  1. In big data analytics, it is common to serialize large sets of records as multiple JSON documents separated by white spaces. You can now get the benefits of On Demand while parsing almost infinitely long streams of JSON records. At each step, you have access to the current document, but a secondary thread indexes the following block. You can thus access enormous files while using a small amount of memory and achieve record-breaking speeds.
  2. Given an On Demand instance (value, array, object, etc.), you can now convert it to a JSON string using the to_json_string method which returns a string view in the original document for unbeatable speeds.
  3. The On Demand front-end now supports the JSON Pointer specification. You can request a specific value using a JSON Pointer within a large document.

The release 1.0 is robust. We have extended and improved our documentation. We have added much testing.

The simdjson library is the result of the work of many people. I would like to thank Nicolas Boyer for working with me over the summer on finishing this version.

You can find simdjson on GitHub. You can use it by adding two files to your project (simdjson.h and simdjson.cpp), or as a CMake dependency or using many popular package managers.

Science and Technology links (September 18th 2021)

    1. 4.5% of us are psychopaths.
    2. U.S. per capita CO2 emissions are lower than they were in 1918.
    3. 9/10 of People With Alzheimer’s Lose Some of Their Sense of Smell.
    4. Graphene-based hard drives could have ten times the storage capacity.
    5. Ageing yields improvements as well as declines across attention and executive functions.
    6. Progress toward an Alzheimer’s vaccine.
    7. The world’s first recorded case of an autonomous drone attacking humans took place in March 2020.
    8. Abstinence from alcohol is associated with an increased risk for all-cause dementia.

Random identifiers are poorly compressible

It is common in data engineering to find that we have too much data. Thus engineers commonly seek compression routines.

At the same time, random identifiers are handy. Maybe you have many users or transactions and you want to assign each one of them a unique identifier. It is not uncommon for people to use wide randomized identifiers, e.g. 64-bit integers.

By definition, if your identifiers are random, they are hard to compress. Compression fundamentally works by finding and eliminating redundancy.

Thus, if you want your identifiers to be compressible, you should make sure that they are not random. For example, you can use local identifiers that are sequential or nearly sequential (1, 2,…). Or you may try to use as few bits as possible per identifier: if you have only a couple of billions of identifiers, then you may want to limit yourself to 32-bit identifiers.

Often people do not want to design their systems by limiting the identifiers. They start with wide random identifiers (64-bit, 128-bit or worse) and they seek to engineer compressibility back into the system.

If you generate distinct identifiers, some limited compression is possible. Indeed, if the same integer cannot occur twice, then knowing one of them gives you some information on the others. In the extreme case where you have many, many distinct identifiers, then make become highly compressible. For example, if I tell you that I have 264 distinct 64-bit integers, then you know exactly what they are (all of them). But you are unlikely to ever have 264 distinct elements in your computer.

If you have relatively few distinct 64-bit integers, how large is the possible compression?

We can figure it out with a simple information-theoretical analysis. Let n be the number of identifiers (say 264), and let k be the number of distinct identifiers in your system. There “n choose k” (the binomial coefficient) different possibilities. You can estimate “n choose k” with nk/k! when k is small compared to n and n is large. If I want to figure out how many bits are required to index a value amount nk/k!, I need to compute the logarithm. I get k log n – log k! where the log is in base 2. Without compression, we can trivially use log n bits per entry. We see that with compression, I can save about 1/k log k! bits per entry. By Stirling’s approximation, that is no more than log k. So if you have k unique identifiers, you can save about log k bits per identifier with compression.

My analysis is rough, but should be in the right ball park. If you have 10,000 unique 64-bit identifiers then, at best, you should require about per identifier… so about a 20% saving. It may not be practically useful. With a million unique 64-bit identifiers, things are a bit better and you can reach a saving of about 40%. You probably can get close to this ideal compression ratio with relatively simple techniques (e.g., binary packing).

To get some compression, you need to group many of your identifiers. That is not always convenient. Furthermore, the compression itself may trade some performance away for compression.

How I debate

Many of us feel that the current intellectual climate is difficult to bear. When I first noticed the phenomenon, people told me that it was because of Donald Trump. He just made it impossible to debate calmly. But now that Trump is gone, the climate is just as bad and, if nothing else, much worse.

Debates are essential in a free society. The alternative to debate is force. Either you convince your neighbour to do as you think they should do, or else you send men with guns to his place.

It is tempting, when you have the upper hand, to use force and aggressive tactics against your opponents. However, this leaves little choice to your opponents: they have to return the favour. And if you both live long enough, chances are that they will.

Civility is a public good. We have to all commit to it.

I do not pretend to be the perfect debater. I make mistakes all the time. However, I try to follow these rules.

  1. Do not hope to change people’s core stance. This rarely, if ever, happens. That is not why we debate. If someone is in favour of Brexit and you are not, you can argue until you are blue in the face and they won’t change their stance. One of the core reasons to debate is to find common ground. People will naturally shy away from arguments that are weak. You can see a debate as a friendly battleground. Once the battle is over, you have probably not taken the other person’s moat, but if you did your job, you have enticed them to drop bad arguments. And they have done the same: they have exposed weaknesses in your models. It implies that the debate should bear on useful elements, like arguments and facts. It also implies that debate can and should be productive… even if it never changes anyone’s stance.
  2. Your goal in a debate is neither to demonstrate that the other person is bad or that you are good. Let people’s character out of the debate. This include your own character. For example, never argue that you are a good person. Reject character assassination, either of yourself or of others. The most popular character assassination tactic is “by association”: “your employer once gave money to Trump so you are a racist”. “You read this news source so you are part of their cult.” You must reject such arguments, whether they are applied to you or to others. Another popular tactic is to question people’s motives. Maybe someone works for a big oil company, so that explains why they are in favour of Brexit. Maybe someone is a member of the communist party, and that’s why they want to give the government more power. It is true that people’s motives impact their opinions, but it has no room in civil debate. You can privately think that a given actor is “sold out”, but you should not say it.
  3. Shy away from authority-based arguments. Saying that such and such is true because such and such individual says so, is counterproductive because the other side can do the same and the debate will be sterile. You can and should provide references and sources, but for the facts and arguments that they carry, not for their authority.

I believe that a case can be made that without a good intellectual climate, liberalism is bound to fade away. If you want to live in a free society, you have to help enforce good debates. If you are witnessing bad debates, speak up. Remind people of the rules. In fact, if I deviate from these rules, remind me: I will thank you.

Further reading: Arne Næss.

The big-load anti-pattern

When doing data engineering, it is common for engineers to want to first load all of the data in memory before processing the data. If you have sufficient memory and the loaded data is not ephemeral or you have small volumes, it is a sensible approach. After all, that is how a spreadsheet typically works: you load the whole speadsheet in memory.

But what if your application falls in the “extract-transform-load” category: as you scan the data, you discard it? What if you have large volumes of data? Then I consider the “load everything in memory at once” a performance anti-pattern when you are doing high performance data engineering.

The most obvious problem with a big load is the scalability. As your data inputs get larger and larger, you consume more and more memory. Though it is true that over time we get more memory, we also tend to get more processing cores, and RAM is a shared ressource. Currently, in 2021, some of the most popular instance types on the popular cloud system AWS have 4 GB per virtual CPU. If you have the means, AWS will provide you with memory-optimized virtual nodes that have 24 TB of RAM. However, these nodes have 448 logical processors sharing that memory.

Frequent and large memory allocations are somewhat risky. A single memory allocation failure may force an entire process to terminate. On a server, it might be difficult to anticipate what other processes do and how much memory they use. Thus each process should seek to keep its memory usage predictable, if not stable. Simply put, it is nicer to build your systems so that, as much as possible, they use a constant amount of memory irrespective of the input size. If you are designing a web service, and you put a hard limit on the size of a single result, you will help engineers build better clients.

You may also encounter various limits which reduce your portability. Not every cloud framework will allow you to upload a 40 GB file at once, without fragmentation. And, of course, on-device processing in a mobile setting becomes untenable if you have no bound on the data inputs.

But what about the performance? If you have inefficient code (maybe written in JavaScript or bad C++), then you should have no worries. If you are using a server that is not powerful, then you will typically have little RAM and a big load is a practical problem irrespective of the performance: you may just run out of memory. But if you are concerned with performance and you have lots of ressources, the story gets more intricate.

If you are processing the data in tiny increments, you can keep most of the data that you are consuming in CPU cache. However, if you are using a big-load, then you need to allocate a large memory region, initialize it, fill it up and then read it again. The data goes from the CPU to the RAM and back again.

The process is relatively expensive. To illustrate the point, I wrote a little benchmark. I consider a function which allocates memory and populates an array of integer with the values 0,1,2…

  int * content = new int[volume/sizeof(int)];
  init(content, volume);
  delete[] content;

It is a silly function: everything you would do that involves memory allocation is likely far slower. So how fast is this fast function? I get the following numbers on two of my machines. I pick the best results within each run.

1 MB 1 GB
alloc-init (best) – AMD Rome Linux 33 GB/s 7 GB/s
alloc-init (best) – Apple M1 30 GB/s 9 GB/s

Simply put, allocating memory and pushing data into it gets slower and slower with the volume. We can explain it in terms of CPU cache and RAM, but the principle is entirely general.

You may consider 7 GB/s or 9 GB/s to be a good speed, and indeed these processors and operating systems are efficient. However, consider that it is actually your starting point. We haven’t read the data yet. If you need to actually “read” that data, let alone transform it or do any kind of reasoning over it, you must then bring it back from RAM to cache. So you have the full cache to RAM and RAM to cache cycle. In practice, it is typically worse: you load the whole huge file into memory. Then you allocate memory for an in-memory representation of the content. You then rescan the file and put it in your data structure, and then you scan again your data structure. Unavoidably, your speed will start to fall… 5 GB/s, 2 GB/s… and soon you will be in the megabytes per second.

Your pipeline could be critically bounded because it is built out of slow software (e.g., JavaScript code) or because you are relying on slow networks and disk. To be fair, if the rest of your pipeline runs in the megabytes per second, then memory allocation might as well be free from a speed point of view. That is why I qualify the big-load to be an anti-pattern for high-performance data engineering.

In a high-performance context, for efficiency, you should stream through the data as much as possible, reading it in chunks that are roughly the size of your CPU cache (e.g., megabytes). The best chunk size depends on many parameters, but it is typically not tiny (kilobytes) nor large (gigabytes). If you bypass such an optimization as part of your system’s architecture, you may have hard limits on your performance later.

It is best to view the processor as a dot at the middle of a sequence of concentric circles. The processor is hard of hearing: they can only communicate with people in the inner circle. But there is limited room in each circle. The further you are from the processor, the more expensive it is for the processor to talk to you because you first need to move to the center, possibly pushing out some other folks. The room close to the processor is crowded and precious. So if you can, you should have your guests come into the center once, and then exit forever. What a big load tends to do is to get people into the inner circle, and then out to some remote circle, and then back again into the inner circle. It works well when there are few guests because everyone gets to stay in the inner circle or nearby, but as more and more people come in, it becomes less and less efficient.

It does not matter how your code looks: if you need to fully deserialize all of a large data file before you process it, you have a case of big load. Whether you are using fancy techniques such as memory file mapping or not, does not change the equation. Some parameters like the size of your pages may help, but they do not affect the core principles.

Adding more memory to your system is likely to make the problem relatively worse. Indeed, systems with lots of memory can often pre-empt or buffer input/output accesses. It means that their best achievable throughput is higher, and thus the big-load penalty relatively worse.

How may you avoid the big-load anti-pattern?

    • Within the files themselves, you should have some kind of structure so that you do not need to consume the whole file at once when it is large. It comes naturally with popular formats such as CSV where you can often consume just one line at a time. If you are working with JSON data files, you may want to adopt to JSON streaming for an equivalent result. Most data-engineering formats will support some concept of chunk or page to help you.
    • Consider splitting your data. If you have a database engine, you may consider sharding. If you are working with large files, you may want to use smaller files. You should be cautious not to fall for the small-load anti-pattern. E.g., do not store only a few bytes per file and do not fragment your web applications into 400 loadable ressources.
    • When compressing data, try to make sure you can uncompress small usable chunks (a few megabytes).

 

Note: If you are an experienced data engineer, you might object that everything I wrote is obvious. I would agree. This post is not meant to be controversial.

How fast can you pipe a large file to a C++ program?

Under many operating systems, you can send data from from one process to another using ‘pipes’. The term ‘pipe’ is probably used by analogy with plumbing and we often use the the symbol ‘|‘ to represent a pipe (it looks like a vertical pipe).

Thus, for example, you can sort a file and send it as input to another process:

sort file | dosomething

The operating system takes care of the data. It can be more convenient than to send the data to a file first. You can have a long sequence of pipes, processing the data in many steps.

How efficient is it computationally?

The speed of a pipe depends on the program providing the data. So let us build a program that just outputs a lot of spaces very quickly:

  constexpr size_t buflength = 16384;
  std::vector<char> buffer(buflength, ' ');
  for(size_t i = 0; i < repeat; i++) {
    std::cout.write(buffer.data(), buflength);
  }

For the receiving program, let us write a simple program that receives the data, little else:

  constexpr size_t cache_length = 16384;
  char cachebuffer[cache_length];
  size_t howmany = 0;
  while(std::cin) {
    std::cin.read(cachebuffer, cache_length);
    howmany += std::cin.gcount();
  }

You could play with the buffer sizes: I use relatively large buffers to minimize the pipe overhead.

I am sure you could write more efficient programs, but I believe that most software using pipes is going to be less efficient than these two programs.

I guess speeds that are quite good under Linux but rather depressing under macOS:

macOS (Big Sur, Apple M1) 0.04 GB/s
Linux (Centos 7, ARM Rome) 2 GB/s to 6.5 GB/s

Your results will be different: please run my benchmark. It might be possible to go faster with larger inputs and larger buffers.

Even if the results are good under Linux, the bandwidth is not infinite. You will get better results passing data from within a program, even if you need to copy it.

As observed by one of the readers of this blog, you can fix the performance problem under macOS by falling back on a C API:

  size_t howmany = 0;
  size_t tr;
  while((tr = read(0, cachebuffer, cache_length))) {
    howmany += tr;
  }

You lose portability, but you gain a lot of performance. I achieve a peak performance of 7 GB/s or above which is much more comparable to the cost of copying the data within a process.

It is not uncommon for standard C++ approaches to disappoint performance-wise.

Science and Technology links (July 31st 2021)

  1. Researchers built a microscope that might be 10 times better than the best available microscopes.
  2. Subsidizing college education can lower earnings due to lower job experience:

    The Post 9/11 GI Bill (PGIB) is among the largest and most generous college subsidies enacted thus far in the U.S. (…) the introduction of the PGIB raised college enrollment by 0.17 years and B.A. completion by 1.2 percentage points. But, the PGIB reduced average annual earnings nine years after separation from the Army.

  3. Better looking academics have more successful careers. If the current pandemic reduces in-person meetings, it could be that this effect might become weaker?
  4. It appears that women frequently rape men:

    (…) male rape happens about as often as female rape, and possibly exceeds it. Evidence also shows that 80% of those who rape men are women.

  5. In the past, Greenland experienced several sudden warming episodes by as much as 16 degrees, without obvious explanation.
  6. Researchers took stem cells, turned them into ovarian follicles, and ended up with viable mice offsprings. Maybe such amazing technology could come to a fertility clinic near your one day.
  7. We may soon benefit from a breakthrough that allows us to grow rice and potatoes with 50% more yield.
  8. American trucks sold today are often longer than military tanks used in the second world war.
  9. Organic food may not be better:

    If England and Wales switched 100 per cent to organic it would actually increase the greenhouse gas emissions associated with our food supply because of the greater need for imports. Scaling up organic agriculture might also put at risk the movement’s core values in terms of promoting local, fresh produce and small family farms.

  10. You can transmit data at over 300 terabits per second over the Internet.

    Not only have the researchers in Japan blown the 2020 record out of the proverbial water, but they’ve done so with a novel engineering method capable of integrating into modern-day fiber optic infrastructure with minimal effort.

    It suggests that we are far away from upper limits in our everyday Internet use and that there are still fantastic practical breakthroughs to come. What could you do with a nearly infinite data bandwidth?

  11. We are using robots to sculpt marble.
  12. Nuclear fusion might bring unlimited energy supplies. It seems that we might be close to a practical breakthrough.
  13. We still do not know why human females have permanent large breasts.
  14. It is unclear whether influenza vaccines are effective.
  15. Some ants effectively never age, because of a parasite living in them.

Measuring memory usage: virtual versus real memory

Software developers are often concerned with the memory usage of their applications, and rightly so. Software that uses too much memory can fail, or be slow.

Memory allocation will not work the same way under all systems. However, at a high level, most modern operating systems have virtual memory and physical memory (RAM). When you write software, you read and write memory at addresses. On modern systems, these addresses are 64-bit integers. For all practical purposes, you have an infinite number of these addresses: each running program could access hundreds of terabytes.

However, this memory is virtual. It is easy to forget what virtual mean. It means that we simulate something that is not really there. So if you are programming in C or C++ and you allocate 100 MB, you may not use 100 MB of real memory at all. The following line of code may not cost any real memory at all:

  constexpr size_t N = 100000000;
  char *buffer = new char[N]; // allocate 100MB

Of course, if you write or read memory at these ‘virtual’ memory addresses, some real memory will come into play. You may think that if you allocate an object that spans 32 bytes, your application might receive 32 bytes of real memory. But operating systems do not work with such fine granularity. Rather they allocate memory in units of “pages”. How big is a page depends on your operating system and on the configuration of your running process. On PCs, a page might often be as small as 4 kB, but it is often larger on ARM systems. Operating systems allow you to request large pages (e.g., one gigabyte). Your application receives “real” memory in units of pages. You can never just get “32 bytes” of memory from the operating system.

It means that there is no sense micro-optimizing the memory usage of your application: you should think in terms of pages. Furthermore, receiving pages of memory is a relative expensive process. So you probably do not want to constantly grab and release memory if efficiency is important to you.

Once you have allocated virtual memory, can we predict the actual (real) memory usage within the following loop?

  for (size_t i = 0; i < N; i++) {
    buffer[i] = 1;
  }

The result will depend on your system. But a simple model is as follows: count the number of consecutive pages you have accessed, assuming that your pointer begins at the start of a page. The memory used by the pages is a lower-bound on the memory usage of your process, assuming that the system does not use other tricks (like memory compression or other heuristics).

I wrote a little C++ program under Linux which prints out the memory usage at regular intervals within the loop. I use about 100 samples. As you can see in the following figure, my model (indicated by the green line) is an excellent predictor of the actual memory usage of the process.

Thus a reasonable way to think about your memory usage is to count the pages that you access. The larger the pages, the higher will be the cost in this model. It may thus seem that if you want to be frugal with memory usage, you would use smaller pages. Yet a mobile operating system like Apple’s iOS has relatively larger pages (16 kB) than most PCs (4 kb). Given a choice, I would almost always opt for bigger pages because they make memory allocation and access cheaper. Furthermore, you should probably not worry too much about virtual memory. Do not blindly count the address ranges that your application has requested. It might have little to no relation with your actual memory usage.

Modern systems have a lot of memory and very clever memory allocation techniques. It is wise to be concerned with the overall memory usage of your application, but you are more likely to fix your memory issues at the software architecture level than by micro-optimizing the problem.

Faster sorted array unions by reducing branches

When designing an index, a database or a search engine, you frequently need to compute the union of two sorted sets. When I am not using fancy low-level instructions, I have most commonly computed the union of two sorted sets using the following approach:

    v1 = first value in input 1
    v2 = first value in input 2
    while(....) {
        if(v1 < v2) {
            output v1
            advance v1
        } else if (v1 > v2) {
            output v2
            advance v2
        } else {
           output v1 == v2
           advance v1 and v2
        }
    }

I wrote this code while trying to minimize the load instructions: each input value is loaded exactly once (it is optimal). It is not that load instructions themselves are expensive, but they introduce some latency. It is not clear whether having fewer loads should help, but there is a chance that having more loads could harm the speed if they cannot be scheduled optimally.

One defect with this algorithm is that it requires many branches. Each mispredicted branch comes with a severe penalty on modern superscalar processors with deep pipelines. By the nature of the problem, it is difficult to avoid the mispredictions since the data might be random.

Branches are not necessarily bad. When we try to load data at an unknown address, speculating might be the right strategy: when we get it right, we have our data without any latency! Suppose that I am merging values from [0,1000] with values from [2000,3000], then the branches are perfectly predictable and they will serve us well. But too many mispredictions and we might be on the losing end. You will get a lot of mispredictions if you are trying this algorithm with random data.

Inspired by Andrey Pechkurov, I decided to revisit the problem. Can we use fewer branches?

Mispredicted branches in the above routine will tend to occur when we conditionally jump to a new address in the program. We can try to entice the compiler to favour ‘conditional move’ instructions. Such instructions change the value of a register based on some condition. They avoid the jump and they reduce the penalties due to mispredictions. Given sorted arrays, with no duplicated element, we consider the following code:

while ((pos1 < size1) & (pos2 < size2)) {
    v1 = input1[pos1];
    v2 = input2[pos2];
    output_buffer[pos++] = (v1 <= v2) ? v1 : v2;
    pos1 = (v1 <= v2) ? pos1 + 1 : pos1;
    pos2 = (v1 >= v2) ? pos2 + 1 : pos2;
 }

You can verify by using the assembly output that compilers are good at using conditional-move instructions with this sort of code. In particular, LLVM (clang) does what I would expect. There are still branches, but they are only related to the ‘while’ loop and they are not going to cause a significant number of mispredictions.

Of course, the processor still needs to load the right data. The address only becomes available in a definitive form just as you need to load the value. Yet we need several cycles to complete the load. It is likely to be a bottleneck, even more so in the absence of branches that can be speculated.

Our second algorithm has fewer branches, but it has more loads. Twice as many loads in fact! Modern processors can sustain more than one load per cycle, so it should not be a bottleneck if it can be scheduled well.

Testing this code in the abstract is a bit tricky. Ideally, you’d want code that stresses all code paths. In practice, if you just use random data, you will often have that the intersection between sets are small. Thus the branches are more predictable than they could be. Still, it is maybe good enough for a first benchmarking attempt.

I wrote a benchmark and ran it on the recent Apple processors as well as on an AMD Rome (Zen2) Linux box. I report the average number of nanoseconds per produced element so smaller values are better. With LLVM, there is a sizeable benefit (over 10%) on both the Apple (ARM) processor and the Zen 2 processor.  However, GCC fails to produce efficient code in the branchless mode. Thus if you plan to use the branchless version, you definitively should try compiling with LLVM. If you are a clever programmer, you might find a way to get GCC to produce code like LLVM does: if you do, please share.

system conventional union ‘branchless’ union
Apple M1, LLVM 12 2.5 2.0
AMD Zen 2, GCC 10 3.4 3.7
AMD Zen 2, LLVM 11 3.4 3.0

I expect that this code retires relatively few instructions per cycle. It means that you can probably add extra functionality for free, such as bound checking, because you have cycles to spare. You should be careful not to introduce extra work that gets in the way of the critical path, however.

As usual, your results will vary depending on your compiler and processor. Importantly, I do not claim that the branchless version will always be faster, or even that it is preferable in the real world. For real-world usage, we would like to test on actual data. My C++ code is available: you can check how it works out on your system. You should be able to modify my code to run on your data.

You should expect such a branchless approach to work well when you had lots of mispredicted branches to begin with. If your data is so regular that you a union is effectively trivial, or nearly so, then a conventional approach (with branches) will work better. In my benchmark, I merge ‘random’ data, hence the good results for the branchless approach under the LLVM compiler.

Further reading: For high speed, one would like to use SIMD instructions. If it is interesting to you, please see section 4.3 (Vectorized Unions Between Arrays) in Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018. Unfortunately, SIMD instructions are not always readily available.

Science and Technology links (July 10th 2021)

  1. We use CRISPR, a state-of-the-art gene editing technique, to edit the genes of live human patients in a clinical trials.
  2. A clinical trial has begun regarding an HIV vaccine.
  3. If you choose to forgo meat to fight climate change, you may lower your individual’s lifetime warming contribution by 2 to 4%.
  4. Age-related testosterone loss may strongly contribute to brain dysfunction in older men.
  5. Israel has used autonomous drone swarms to hunt down its adversaries.
  6. Cleaner air has improved agricultural outputs.
  7. Drinking alcohol could extend or reduce your life depending on how you go about it. Drinking to excess could be harmful but moderate alcohol consumption with meals could be beneficial.
  8. Injectable gene editing compounds have extended the lifespan of mice by more than 30% while improving physical performance.
  9. A gene therapy may get your heart to regenerate after a heart attack.
  10. Ocean acidification does not appear to harm coral reef fishes:

    Together, our findings indicate that the reported effects of ocean acidification on the behaviour of coral reef fishes are not reproducible, suggesting that behavioural perturbations will not be a major consequence for coral reef fishes in high CO2 oceans. (Source: Nature)

  11. It appears that when an embryon is formed, it briefly undergoes a rejuvenation effect. It would explain how old people can give birth to a young child.
  12. We have long believed that Alzheimer’s was caused by the accumulation of misfolded proteins in the brain. It appears that it could be, instead, due to the disappearance of soluble proteins in the brain. If so, the cure for Alzheimer’s would entail restoring a normal level of proteins instead of removing misfolded proteins.
  13. A new weight-loss drug was approved to fight against obesity.
  14. Unity Biotechnology announced some positive results in a clinical trial using senolytics (drugs that kill old cells) to treat macular degeneration (a condition that make old people blind).
  15. Lobsters are believed to experience negligible senescence, meaning that they do not age the way we do. They also appear to avoid cancer.

Compressing JSON: gzip vs zstd

JSON is the de facto standard for exchanging data on the Internet. It is relatively simple text format inspired by JavaScript. I say “relatively simple” because you can read and understand the entire JSON specification in minutes.

Though JSON is a concise format, it is also better used over a slow network in compressed mode. Without any effort, you can compress often JSON files by a factor of ten or more.

Compressing files adds an overhead. It takes time to compress the file, and it takes time again to uncompress it. However, it may be many times faster to send over the network a file that is many times smaller. The benefits of compression go down as the network bandwidth increases. Given the large gains we have experienced in the last decade, compression is maybe less important today. The bandwidth between nodes in a cloud setting (e.g., AWS) can be gigabytes per second. Having fast decompression is important.

There are many compression formats. The conventional approach, supported by many web servers, is gzip. There are also more recent and faster alternatives. I pick one popular choice: zstd.

For my tests, I choose a JSON file that is representative of real-world JSON: twitter.json. It is an output from the Twitter API.

Generally, you should expect zstd to compress slightly better than gzip. My results are as follow using standard Linux command-line tools with default settings:

uncompressed 617 KB
gzip (default) 51 KB
zstd (default) 48 KB

To test the decompression performance, I uncompress repeatedly the same file. Because it is a relatively small file, we should expect disk accesses to be buffered and fast.

Without any tweaking, I get twice the performance with zstd compared to the standard command-line gzip (which may differ from what your web server uses) while also having better compression. It is win-win. Modern compression algorithms like zstd can be really fast. For a fairer comparison, I have also included Eric Biggers’ libdeflate utility. It comes out ahead of zstd which stresses once more the importance of using good software!

gzip 175 MB/s
gzip (Eric Biggers) 424 MB/s
zstd 360 MB/s

My script is available. I run it under a Ubuntu system. I can create a RAM disk and the numbers go up slightly.

I expect that I understate the benefits of a fast compression routines:

    1. I use a docker container. If you use containers, then disk and network accesses are slightly slower.
    2. I use the standard command-line tools. With a tight integration of the software libraries within your software, you can probably avoid many system calls and bypass the disk entirely.

Thus my numbers are somewhat pessimistic. In practice, you are even more bounded by computational overhead and by the choice of algorithm.

The lesson is that there can be large differences in decompression speed and that these differences matter. You ought to benchmark.

What about parsing the uncompressed JSON? We have demonstrated that you can often parse JSON at 3 GB/s or better. I expect that, in practice,  you can make JSON parsing almost free compared to compression, disk and network delays.

Update: This blog post was updated to include Eric Biggers’ libdeflate utility.

Note: There has been many requests for more to expand this blog post with various parameters and so forth. The purpose of the blog post was to illustrate that there are large performance differences, not to provide a survey of the best techniques. It is simply out of the scope of the current blog post to identify the best approach. I mean to encourage you to run your own benchmarks.

See also: Cloudflare has its own implementation of the algorithm behind gzip. They claim massive performance gains. I have not tested it.

Further reading: Parsing Gigabytes of JSON per Second, VLDB Journal 28 (6), 2019

Science and Technology links (June 26th 2021)

  1. Reportedly, half of us own a smartphone.
  2. It is often reported that women or visible minority earn less money. However, ugly people are doing comparatively even more poorly.
  3. We have highly efficient and cheap solar panels. However, we must also quickly dispose of them after a few years which leads to trash. Handling an increasing volume of trash is not cheap:

    The totality of these unforeseen costs could crush industry competitiveness. (…) By 2035, discarded panels would outweigh new units sold by 2.56 times. In turn, this would catapult the [cost] to four times the current projection. The economics of solar — so bright-seeming from the vantage point of 2021 — would darken quickly as the industry sinks under the weight of its own trash.

  4. Ultrasounds may be a viable therapy against Alzheimer’s.
  5. Axolotls regenerate any part of their body, including their spinal cord. Though they only weight between 60g and 110g, they may live 20 years.
  6. Some dinosaurs once thrived in the Arctic.
  7. It was once believed that hair greying was essentially an irreversible process. Researchers looked at actual hair samples and found that in many cases, a hair that once turned gray could regain its colour.
  8. At least as far as your muscles go, you can remain fit up to an advanced age. As supportive evidence, researchers found that human muscle stem cells are refractory to aging:

    We find no measurable difference in regeneration across the range of ages investigated up to 78 years of age.

    Further work shows that you can promote muscle regeneration by reprogramming the cells in muscle fibers, thus potentially activating these muscle stem cells.

  9. Many people who are overweight suffer from the metabolic syndrome which includes abdominal obesity, high blood pressure, and high blood sugar. It affects about 25% of us worldwide. Researchers have found that a particular hormone, Asprosin, is elevated in people suffering for the metabolic syndrome. Suppressing asprosin in animal models reduced appetite and increased leanness, maybe curing the metabolic syndrome. We may hope that this work will quickly lead to human trials.
  10. During the current decade, the share of Canadians that are 65 years and older  will rise from 17.4 per cent to 22.5 per cent of the population.
  11. The Internet probably use less energy than you think and most projections are pessimistic.