We released simdjson 0.3: the fastest JSON parser in the world is even better!

Last year (2019), we released the simjson library. It is a C++ library available under a liberal license (Apache) that can parse JSON documents very fast. How fast? We reach and exceed 3 gigabytes per second in many instances. It can also parse millions of small JSON documents per second.

The new version is much faster. How much faster? Last year, we could parse a file like simdjson at a speed of 2.0 GB/s, and then we reached 2.2 GB/s. We are now reaching 2.5 GB/s. Why go so fast? In comparison, a fast disk can reach  5 GB/s and the best network adapters are even faster.

The following plot presents the 2020 simdjson library (version 0.3) compared with the fastest standard compliant C++ JSON parsers (RapidJSON and sajson). It ran on a single Intel Skylake core, and the code was compiled with the GNU GCC 9 compiler. All tests are reproducible using Docker containers.

In this plot, RapidJSON and simjson have exact number parsing, while RapidJSON (fast float) and sajson use approximate number parsing. Furthermore, sajson has only partial unicode validation whereas other parsers offer exact encoding (UTF8) validation.

If we only improved the performance, it would already be amazing. But our new release pack a whole lot of improvements:

  1. Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually.
  2. Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. In the past, using simdjson was a bit of a chore, the new approach is definitively modern, see for yourself:
    auto cars_json = R"( [
      { "make": "Toyota", "model": "Camry",  "year": 2018, 
           "tire_pressure": [ 40.1, 39.9 ] },
      { "make": "Kia",    "model": "Soul",   "year": 2012, 
           "tire_pressure": [ 30.1, 31.0 ] },
      { "make": "Toyota", "model": "Tercel", "year": 1999, 
           "tire_pressure": [ 29.8, 30.0 ] }
    ] )"_padded;
    dom::parser parser;
    dom::array cars = parser.parse(cars_json).get<dom::array>();
    
    // Iterating through an array of objects
    for (dom::object car : cars) {
      // Accessing a field by name
      cout << "Make/Model: " << car["make"] 
               << "/" << car["model"] << endl;
    
      // Casting a JSON element to an integer
      uint64_t year = car["year"];
      cout << "- This car is " << 2020 - year 
               << "years old." << endl;
    
      // Iterating through an array of floats
      double total_tire_pressure = 0;
      for (double tire_pressure : car["tire_pressure"]) {
        total_tire_pressure += tire_pressure;
      }
      cout << "- Average tire pressure: " 
          << (total_tire_pressure / 2) << endl;
    
      // Writing out all the information about the car
      for (auto [key, value] : car) {
        cout << "- " << key << ": " << value << endl;
      }
    }
    

  3. Exact Float Parsing: simdjson parses floats flawlessly at high speed.
  4. Fallback implementation: simdjson now has a non-SIMD fallback implementation, and can run even on very old 64-bit machines. This means that you no longer need to check whether the system supports simdjson.
  5. Automatic allocation: as part of API simplification, the parser no longer has to be preallocated: it will adjust automatically when it encounters larger files.
  6. Runtime selection API: We have exposed simdjson’s runtime CPU detection and implementation selection as an API, so you can tell what implementation we detected and test with other implementations.
  7. Error handling your way: Whether you use exceptions or check error codes, simdjson lets you handle errors in your style. APIs that can fail return simdjson_result, letting you check the error code before using the result. But if you are more comfortable with exceptions, skip the error code
    and cast straight to the value you need, and exceptions will be thrown automatically if an error happens. Use the same API either way!
  8. Error chaining: We also worked to keep non-exception error-handling short and sweet. Instead of having to check the error code after every single operation, now you can chain JSON navigation calls like looking up an object field or array element, or casting to a string, so that you only have to check the error code once at the very end.
  9. We now have a dedicated web site (https://simdjson.org) in addition to the GitHub site (https://github.com/simdjson/simdjson).

Credit: many people contributed to simdjson, but John Keiser played a substantial role worthy of mention.

Science and Technology links (March 28th 2020)

    1. In a laboratory, we know how to turn any of our cells into youthful stem cells using something called the Yamanaka. If you expose cells to such factors for a short time, they appear to remain functional specialized cells but become more youthful. Researchers demonstrated this theory using cartilage cells from elderly patients suffering from osteoarthritis. They also rejuvenated muscle cells. It now remains to do the same in live human beings.
    2. It has been widely reported that artificial intelligence, and specifically deep learning, can match or surpass clinicians in medical imaging tasks. Unfortunately, it appears that this is far from demonstrated with the necessary rigor:

      Few prospective deep learning studies and randomised trials exist in medical imaging. Most non-randomised trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small.

    3. Apple released its latest tablet (the iPad Pro) with an integrated LiDAR that can map accurately your environment at up to 5 meters of distance. A LiDAR is basically a laser-radar. It was famously used by NASA to map the lunar surface in the 1970s but it was a technology out of reach to all of us twenty years ago: reserved for military and highly specialized applications.
    4. Japan and Korea have more than 13 hospital beds per 1000 people; Spain, Italy, and the U.S. have about 3 beds per 1000 people.
    5. Due to a worldwide pandemic, we are running the largest distance-learning experiment in history. Countless teachers worldwide are teaching online for the first time.
    6. Modern disks (such as a USB drive) might be lighter when they are filled with data than when they are empty.
    7. Our smartphones will soon move from 4G networks to 5G networks. The latter are much faster. However, they cause the phones to consume 20% more energy according to some report.
    8. A few decades ago, most of us had computers with a single processor. Over time we acquired processors with two processor cores, each core acting as a processor. Then we got four cores with some cores able to act as if they are made of two or four “logical” processors. The next gaming consoles (e.g., the PS5) will have main CPUs with eight processor cores. It is not difficult to find servers that have well over a hundred logical processors. Yet it appears that Windows was limited to at most 64 logical processors, possibly because the architects of Windows never imagined that we would soon reach this limit.

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

Memory is organized in cache lines, frequently blocks of 64 bytes. On Intel and AMD processors, you can store and load memory in blocks of various sizes, such as 64 bits, 128 bits or 256 bits.

In the old days, and on some limited devices today, reading and storing to memory required you to respect alignment. You could not simply write any block memory anywhere. Today you mostly can write wherever you like. There is a small penalty for misalignment but the penalty is typically under 10% as I argued in my 2012 post Data alignment for speed: myth or reality?

Yet writing or reading from two cache lines (what Intel calls a cache line split) instead of a single one is likely to be more expensive at least some of the time. Let us explore an interesting scenario. It is sometimes possible to avoid crossing a cache line boundary by doing two memory accesses instead of a single large one. Is it worth it?

Cache lines in memory are aligned on addresses that are divisible by 64 bytes. Suppose that you would want to store 256 bits of data every 64 bytes, at just the right offset so that the 256 bits overlap two cache lines. You hit last 16 bytes of one cache line and the first 16 bytes of the second one. You can achieve the desired results by starting with an offset of 48 bytes. That is, you find find a memory address that is divisible by 64 bytes, and then you add 48 bytes.

In code, using Intel intrinsics, it looks as follow:

char * p = ...
for (size_t i = 0; i < ... ; i++) {
  _mm256_storeu_si256(p + i * 64, vec);
}

You can avoid entirely crossing the cache line bounding by first storing 128-bit of data at the 48-byte offset, and then storing another 128-bit of data. The first store is at the end of the first cache line and the second store is at the beginning of the second one.

char * p = ...
for (size_t i = 0; i < ... ; i++) {
      _mm_storeu_si128(p + i * 64, vec);
      _mm_storeu_si128(p + i * 64 + 16, vec);
}

How do these two approaches fare? I wrote a simple benchmark that stores many blocks of 256-bit at a 48-byte offset. It either stores it in one 256-bit step or in two 128-bit steps. I record the number of cycles per iteration on an AMD Rome processor. I rely on the the pre-installed RedHat LLVM compiler (clang version 3.4.2).

A single 256-bit write 2.33 cycles
Two 128-bit writes 2.08 cycles

It is a gain of slightly over 10% for the two 128-bit writes. What if you remove the 48-byte offset (or set it to zero)? Then both benchmark clock at 2.08 cycles per iteration. I expect that the 48-byte offset is a best-case scenario for the two 128-bit writes: if you change it then both approaches have the same cache-line overlap problems. So this 10% gain requires you to choose the alignment carefully.

My source code is available. Of course, your results will depend on the processor and to some extend on the compiler. You should be able to run my benchmark program on your own Linux x64 system.

Be mindful that if you are getting worse results on a per cycle basis on comparable hardware, you might be limited by your compiler. An analysis of the assembly might be required.

Further reading: Travis Downs has an interesting complementary analysis. He finds that unaligned stores crossing a 32-byte boundary can be tremendously expensive (i.e., 5 cycles) on the type of processor I am using for these tests (Zen 2). The 32-byte boundary exists irrespective of cache lines. Meanwhile, he finds that stores aligned exactly on a 32-byte boundary are much cheaper (1 cycle).

Number of atoms in the universe versus floating-point values

It is estimated that there are about 1080 atoms in the universe. The estimate for the total number of electrons is similar.

It is a huge number and it far exceeds the maximal value of a single-precision floating-point type in our current computers (which is about 1038).

Yet the maximal value that we can represent using the common double-precision floating-point type is larger than 10308. It is an unimaginable large number. There will never be any piece of engineering involving as many as 10308 parts.

Using a double-precision floating-point value, we can represent easily the number of atoms in the universe. We could also represent the number of ways you can pick any three individual atoms at random in the universe.

If your software ever produces a number so large that it will not fit in a double-precision floating-point value, chances are good that you have a bug.

Further reading: Lloyd N. Trefethen, Numerical Analysis, Princeton Companion to Mathematics, 2008

 

Science and Technology links (March 14th 2020)

    1. Mothers, but not fathers, possess gender-related implicit biases about emotion expression in children.
    2. Chinese researchers used to be offered cash rewards for publishing research articles. The Chinese government has banned such rewards. Increasingly, we are collectively realizing that a single-minded focus on publication numbers ensures systemic waste and failure. This is not unique to the business of research. You do not want to assess programmers by counting the number of lines of code they write, or writers by how many books they have published. The naive application of simple metrics can lead to disastrous results.
    3. A patient’s lung was removed, cleaned from cancer and put back.
    4. Many major American universities have moved classes online for the rest of the term due to the ongoing pandemic.
    5. Medical doctors keep on practicing many inefficient or harmful therapies, against all scientific evidence. For example, stents are frequently put into people at risk of cardiovascular disease, even though it is virtually useless.

Fast float parsing in practice

In our work parsing JSON documents as quickly as possible, we found that one of the most challenging problem is to parse numbers. That is, you want to take the string “1.3553e142” and convert it quickly to a double-precision floating-point number. You can use the strtod function from the standard C/C++ library, but it is quite slow. People who write fast parsers tend to roll their own number parsers (e.g., RapidJSON, sajson), and so we did. However, we sacrifice some standard compliance. You see, the floating-point standard that we all rely on (IEEE 754) has some hard-to-implement features like “round to even”. Sacrificing such fine points means that you can be off by one bit when decoding a string. As such, this never matters: double-precision numbers have more accuracy than any engineering project will ever need and a difference on the last bit is irrelevant. Nevertheless, it is mildly annoying.

A better alternative in C++ might be from_chars. Unfortunately, many standard libraries have not yet caught up the standard and they fail to support from_chars properly. One can get around this problem by using the excellent abseil library. It tends to be much faster than venerable strtod function.

Unfortunately, for our use cases, even abseil’s from_chars is much too slow. It can be two or three times slower than our fast-but-imperfect number parser.

I was going to leave it be. Yet Michael Eisel kept insisting that it should be possible to both follow the standard and achieve great speed. Michael gave me an outline. I was unconvinced. And then he gave me a code sample: it changed my mind. The full idea requires a whole blog post to explain, but the gist of it is that we can attempt to compute the answer, optimistically using a fast algorithm, and fall back on something else (like the standard library) as needed. It turns out that for the kind of numbers we find in JSON documents, we can parse 99% of them using a simple approach. All we have to do is correctly detect the error cases and bail out.

Your results will vary, but the next table gives the speed numbers from my home iMac (2017). The source code is available along with everything necessary to test it out (linux and macOS).

parser MB/s
fast_double_parser (new) 660 MB/s
abseil, from_chars 330 MB/s
double_conversion 250 MB/s
strtod 70 MB/s

Science and Technology links (March 7th 2020)

  1. The benefits of flu vaccines in healthy adults is modest. They do not reduce neonatal death, hospitalisations, or working day lost. It does not seem more helpful in older adults.
  2. While in 1970, only 1% of the American population was severely obese, the percentage of obesity is now at 9%.
  3. Older adults who volunteer and stay active maintain a larger brain.
  4. The election of Donald Trump in 2016 lead to slightly fewer baby boys in Canada. Stress tends to favor female births.
  5. The plague is still killing people in Madagascar. The disease is carried by rats.
  6. The social cost of carbon has been estimated to be as high as $60 a tonne, meaning that if we spent $60 to remove the production of a tonne of carbon, we would be even. Some of the latest recent estimates are much lower: between $0.60 and $3. These new estimates take into account the benefits of CO2 such as higher plant productivity.

Will calling “free” or “delete” in C/C++ release the memory to the system?

In the C programming language, we typically manage memory manually. A typical heap allocation is a call to malloc followed by a call to free. In C++, you have more options, but it is the same routines under the hood.

// allocate N kB
data = malloc(N*1024);
// do something with the memory
// ...
// release the memory
free(data);

It stands to reason that if your program just started and the value of N is large, then the call to malloc will result in an increased memory usage by about N kilobytes. And indeed, it is the case.

So what is the memory usage of your process after the call to “free”? Did the N bytes return to the system?

The answer is that, in general, it is not the case. I wrote a small program under Linux that allocates N kilobytes and then frees them. It will then measure the RAM usage after the call to free. The exact results will depend on your system, standard library and so on, but I give my results as an illustration.

As you can observe in the table, the memory does sometimes get released, but only when it is a large block of over 30 MB in my tests. It is likely because in such cases a different code path is used (e.g., calling mmap, munmap). Otherwise, the process holds on to its memory, never again releasing it to the system.

memory requested memory usage after a free
1 kB 630 kB
100 kB 630 kB
1000 kB 2000 kB
10,000 kB 11,000 kB
20,000 kB 21,000 kB
30,000 kB 31,000 kB
40,000 kB 1,200 kB
50,000 kB 1,300 kB
100,000 kB 1,300 kB

Of course, there are ways to force the memory to be released to the system (e.g., malloc_trim may help), but you should not expect that it will do so by default.

Though I use C/C++ as a reference, the exact same effect is likely to occur in a wide range of programming languages.

What are the implications?

  • You cannot measure easily the memory usage of your data structures using the amount of memory that the processes use.
  • It is easy for a process that does not presently hold any data to appear to be using a lot of memory.

Further reading: glibc malloc inefficiency

Science and Technology links (February 29th 2020)

  1. No one really understands how planes fly. This puts a dent in the model whereas inventions follows theory.
  2. Physician salaries and diagnostic tests account for 4% of GDP in the US.
  3. A supplement (NMN) available to human beings has been used in mice: it restores fertility in old female mice.
  4. The investment firm Andreesen Horowitz is critical of artificial intelligence start-ups. It finds that they have low margins due to the cost of training models using many computers on the cloud. They may fail to scale due to the difficulty of generalizing use cases. And they are subject to commodification. That is, “artificial intelligence” as a business proposition may be far less profitable than typical “software” ventures.
  5. The most profitable businesses tend to be started by older people who have lots of experience in a specific industry.
  6. Dog can detect heat sources far away (1.6 meters) with their nose.
  7. Cancer patients received genetically edited immune cells (using CRISPR). It turned out well.

Fast divisionless computation of binomial coefficients

Suppose that I give you a set of n objects and I ask you to pick k distinct objects, thus forming a new subset. How many such subsets are there? If you have taken college mathematics, you have probably learned that the answer is given by binomial coefficients. They are defined as n!/(k! * (n-k)!) where the exclamation point refers to the factorial. For a programmer, it is easier to think of binomial coefficients as the result of this loop:

def binom(n,k):
    top = 1
    bottom = 1
    for i in 1, 2, ..., k:
        top *= n - i + 1
        bottom *= i
    return top/bottom

Though simple enough, this algorithmic definition is not practical if you are interested in performance. Both the numerator and the denominator grow large quickly. They will soon require several machine words. A programming language like Python will happily give you the correct answer, but slowly. In Java or C, you are likely to get the wrong result due to a silent overflow.

Of course, if you know that the binomial coefficient is too large to fit in a machine word (64 bits), then you may as well go to a big-integer library. But what if you know that the result fits in a machine word? Maybe you have a reasonable bound on the size of n.

Then instead of waiting at the very end before doing the division, you can divide at each iteration in the loop:

def binom(n,k):
    answer = 1
    for i in 1, 2, ..., k:
        answer = answer * (n - k + 1) / k
    return answer

This new approach may still overflow even if the binomial coefficient itself fits in a machine word because we multiply before dividing. You can get around this issue by first finding a common divisor to both the multiplier and the divisor, and factoring it out. Or else, you can further restrict the values of n and k. Let us choose this last path.

We still have as a problem that we need k-1 multiplications and divisions. The multiplications are relatively cheap, but the divisions have longer latencies. We would prefer to avoid divisions entirely. If we assume that k is small, then we can just use the fact that we can always replace a division by a known value with a shift and a multiplication. All that is needed is that we precompute the shift and the multiplier. If there are few possible values of k, we can precompute it with little effort.

Hence, if you know that, say, n is smaller than 100 and k smaller than 10, the following function will work…

uint64_t fastbinomial(int n, int k) {
  uint64_t np = n - k;
  uint64_t answer = np + 1;
  for (uint64_t z = 2; z <= (uint64_t)k; z++) {
    answer = answer * (np + z); 
    fastdiv_t f = precomputed[z];
    answer >>= f.shift;
    answer *= f.inverse;
  }
  return answer;
}

I provide a full portable implementation complete with some tests. Though I use C, it should work as-is in many other programming languages. It should only take tens of CPU cycles to run. It is going to be much faster than implementations relying on divisions.

Another trick that you can put to good use is that the binomial coefficient is symmetric: you can replace k by nk and get the same value. Thus if you can handle small values of k, you can also handle values of k that are close to n. That is, the above function will also work for n is smaller than 100 and k larger than 90, if you just replace k by nk.

Is that the fastest approach? Not at all. Because n is smaller than 100 and k smaller than 10, we can precompute (memoize) all possible values. You only need an array of 1000 values. It should fit in 8kB without any attempt at compression. And I am sure you can make it fit in 4kB with a little bit of compression effort. Still, there are instances where relying on a precomputed table of several kilobytes and keeping them in cache is inconvenient. In such cases, the divisionless function would be a good choice.

Alternatively, if you are happy with approximations, you will find floating-point implementations.

Science and Technology links (February 22nd 2020)

    1. In a large cohort study, the highest probability of reaching 90 years old was found for those drinking between 5g and 15 g of alcohol per day. This does not mean that if you are not drinking, you should start.
    2. The Earth is getting greener thanks to CO2. In turn, a greener Earth will mitigate global warming. (Source: Nature)
    3. In 2019, the carbon emissions in the US fell by 2.9%. They fell by 5% in the European Union. They also fell in Japan. (Source: Bailey)
    4. Robots can take blood samples and apparently do competent job, according to a clinical trial.
    5. We may soon get 80-terabyte disk drives.
    6. The age-adjusted cancer rate in the US is currently about at the same level as it was in 1930. We are not winning the war against cancer.
    7. You are better off cutting your food on wooden planks, they are more hygienic that plastic planks.
    8. Science is undergoing what some people call the “reproducibility crisis”: may important results reported in prestigious venues cannot be reproduced by other scientists, independently. Miyakawa suggests that the reproducibility crisis might be related to the fact that studies are frequently fabricated:

      (…) more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning.

      A few years ago, I was on the PhD committee of a student. I questioned the results. Ultimately, we asked for the software that produced that data. The student quickly reported that the software had been lost, deleted by the University. We declined to grant the PhD despite an extensive publication record (with articles in some of the best venues). I paid a political price for my choice to fail the student. The student eventually did get a PhD after an appeal. I would not be surprised to learn that this student became a professor. The lesson is that you should always doubt scientific studies. Ask that they be independently reproduced.

My thoughts on how research funding is attributed in Canada to Computer Science

In Canada, most computer science professors seek funding with NSERC, the main Canadian funding agency for science and engineering. It is more or less the equivalent of the American NSF. The core NSERC program is called “discovery” and it funds 5-year research programs. So, roughly speaking, currently funded professors apply for funding about once every 5 years. Once funded, you do not have to pursue the program you proposed: we recognize that you cannot be expected to stupidly stay on the same course for five years in a fast moving field.

Applicants get a rating regarding the excellence of the researcher, the merit of their proposal and on how well they are training students. It is an all-around evaluation. It is quite stressful.

Not all professors are funded or seek funding. However, it is common for computer science departments to expect their new faculty to seek funding. At many places, getting funding is a necessary condition to get tenure. I would expect it to be the case at all top departments. In effect, the NSERC discovery program act as a Canada-wide peer review process.

The grants are modest: from about 20k$ a year to 100k$. Very few people get 100k$ a year, you basically have to be a Turing award recipient. So it is not a lot of money compared to what American professors get. In most cases, all the money goes to students. Professors cannot pay themselves. So getting a grant does not increase your salary.

In computer science, applications go to a committee made of between 40 to 50 people. Most are from Canadian universities, but there are some people from industry and from other countries. Each application is reviewed by a team of five committee member, supervised by a chair (who is also a member of the committee) as well as at least one program officer. There are also external reviews which are taken seriously, but are just one element among many. The applicants must provide samples of their research; they committee members browse and discuss these papers. And there is a 5-page proposal describing the science that the applicant wants to pursue.

I just finished a term as co-president of the committee. It is a lot of work. I could probably have written five more papers these last three years without this service responsibility. Let me add that it is unpaid.

Here are my take-away from the experience:

  1. We often hear that science is all about publishing lots and lots of papers. That is definitively not true. Once you put a bunch of professional computer scientists in a room and you tell them to judge someone… they quickly dismiss sheer volume. They seek quality. They seek actual science. They also tend to go with proxies, as in “they published in the best venue”. Yet, even there, it is not so much the volume that matters as the fact that specific journals and conferences are especially selective. Committee members are eager for explanations as to why the research is great; it is often the applicants themselves who are not forthcoming about details. If you wrote about a breakthrough in an online note or presented it at a local workshop, your peers will be happy to consider it, though you have a bit more explaining to do than if it appeared in a prestigious venue. And it is definitively possible to get the highest ratings without a pursuit of prestigious venues.
  2. We take services to the community and to society very seriously. It is not all about papers.
  3. I don’t think bibliometrics like citations get discussed much at all: again, it is not about numbers. Being highly cited can be good, but it is not the end game.
  4. It is surprising how even the most experienced researchers sometimes cannot describe competently a research proposal. Sometimes there are goals, but no means to achieve them. Sometimes the objectives are all over the map and incoherent. People will mix and match ideas in the most absurd manner.
  5. The committee is quite open to new and bold ideas. In fact, it rewards bold ideas if they are presented in a competent manner.
  6. Years after years, I have seen “old ideas” being praised when put into a solid program. Not everything good has to be about bitcoins and deep learning. That is, it is not required that you work on fashionable ideas. The committee has a lot of tolerance for unfashionable ideas.
  7. People who try to “pad” their resume to look impressive take risks. Committee members get annoyed very quickly at people gaming the system.
  8. Everyone gets assessed on the same grid. It does not matter whether you are at the best or worst university. It does not matter if you are 20 years old or 70 years old. Evidently, it does not matter whether you are man or not. So it is a system that is hard on younger, less experienced professors. It is also hard for people from small institutions. If you are both inexperienced and from a small institution, it is especially difficult. The counterpart is that if you do well while being at a small institution, you should be especially proud of yourself.
  9. Unfortunately, as with every year, many professors will get bad news. They will have failed to get funding. This may mean that they will soon lose their job. Some people are just bad at what they do or they have bad ideas: it is a service to tell them loud and clear. But most people who fail to receive funding at not bad. In many cases, the merit of their work was clear. This is not a perfect system, nor can it be. Some people simply have not yet had the means to reach the thresholds set by the grid. Some people do work that just do not fit well with the grid. This makes me sad and slightly depressed, but there is no easy fix. In my department, we specifically do not require that new professors get a discovery grant to receive tenure. We do our own assessment.
  10. It is definitively not all about politics. I have heard rumours about people from a given school of thought trying to sink people from another school, or people trying to be advocate for their own speciality. I am sure it happens. However, when it is observed, it does get some attention. Nothing is ever perfect, but we don’t let politics take over.

All in all, I feel better about my peers and about computer science after this experience at NSERC. I am generally very concerned about quality in research. There is a lot of junk out there. Yet there is also a lot of good people try to do good work. My expectation is that the Canadian system is probably one of the best in the world because it puts quality and good science first.

 

Further reading: Ebrahim Bagheri was a fellow committee member and he wrote about his experience on twitter.

Science and Technology links (February 8th 2020)

  1. It is often believed that radiations are bad for you. To the contrary, David et al. report that life expectancy is approximately 2.5 years longer in people living in areas with an elevated background radiation in the USA.
  2. Birth order, that is whether you are the oldest or youngest sibling, is independently associated with a number of health and performance metrics. The first born is usually more fit and stable. Lin et al. argue that the main explanation might be that younger siblings are more likely to have been unwanted.
  3. The University of California has completely cancelled its subscription to research papers by Elsevier. Elsevier is one of the most important scientific publisher. It is also a for-profit publisher.
  4. Low levels of Low-density lipoprotein cholesterol (LDL-C), often qualified as “bad cholesterol”, are strongly and independently associated with increased risk of cancer, cardiovascular diseases, and all-cause mortality according to a Korean study made of hundreds of thousands of human subjects. This new study puts into question mainstream beliefs about “bad cholesterol”.
  5. Education is strongly associated with better health and better longevity. However, after controlling for income and living conditions, the relationship between health and education evaporates.
  6. Harvard’s George Church has created a startup that will use an anti-aging gene therapy to increase the longevity of dogs. It is based on previous work done on mice and reported in a paper entitled A single combination gene therapy treats multiple age-related diseases (source: PNAS).
  7. As we age, 90% of us will get gray hair. It is often believed to be an irreversible process. Researchers at the University of Alabama found strong evidence that it is not the case and they believe that hair graying can be reversed. They are launching a company to develop the therapy. Note that there is documented but anecdotal evidence for gray-hair reversal, e.g., in cancer patients.

Research should not stop with the research paper

The practice of academic research is based on the production of formal documents that undergo formal reviewers by peers. We routinely evaluate academics for jobs and promotions based on their publication output. When asked about their contribution to science, many academics are happy to point at their papers. In some cases, they will also point at the grants that they have secured.

Back when I worked at NRC, a friend of mine, Martin Brooks gave a controversial talk entitled “user-based research”. It has been nearly 20 years, and I still remember this talk. His ideas were so upsetting that people left while he was talking. I stayed and listened. It took me years to own the ideas he expressed that day.

Martin complained that researchers mostly “threw research over the wall” expecting that other people (maybe industry) would pick up the research from the other side of the wall. While it certainly can happen that others will read your paper and apply it, you should not count on it. Nobody has to read and understand your work.

When I was a bit younger, a senior professor pointed out that some idea we were discussed had been described and tested by his team in some prestigious venue decades ago but that nobody ever did anything with it. Without thinking, I replied:   it was your job to do something with it, don’t blame others. The room fell silent and there was a long pause.

I am not saying that if you find a way to cure cancer in mice, it is your job to get the therapy cleared for use with human beings and to open a hospital where the therapy is delivered. A single individual can only do so much.

What I am saying, however, is that publishing a research paper is not the goal of research. It is not the final output. It is only one element in a chain. And it is not even a requirement. You need to communicate your idea, but the peer reviewed journal article is just one such mean.

Your actual goal is “transfer”. That is, someone, somewhere, must put your ideas in practice beyond the publication of your paper. It does not mean “industrial applications” though it can be that. If your idea is worthwhile and you let it end with a research paper, you have failed.

And it does not merely mean collecting “citations” or other bibliometrics. People routinely cite papers without reading them. Few citations are influential.

But the academic incentives almost conspire to prevent impactful research. There is one specific criteria that academics like to apply that is destructive: novelty. For some piece of academic work to be judged worthwhile, it must be novel. I will say it again and again: originality is overrated.

Of course, people entering a new field tend to “rediscover known facts”. You have to push them back and tell them to go study some more. But there is a difference between naivité and lack of originality. You have to be aware of the history of your field, that is what scholarship is all about. But you also have to stick with ideas for the long run, until the fruits appear.

Instead of rewarding novelty, we should reward scholarship: we should ask people to show that they have studied and documented the past. We should never penalize someone who works on pushing a known idea by refining it, communicating it, validating it.

This idea that some venerable professor had 20 years ago that never went anywhere? Well, it might be entirely worthwhile to revisit it and make it go somewhere, even if it is not novel at all.

More: This blog post is the subject of a live interview with me on YouTube.

Further reading: Boote, D. N., & Beile, P. (2005). Scholars before researchers: On the centrality of the dissertation literature review in research preparation. Educational researcher, 34(6), 3-15.

Science and Technology links (February 1st 2020)

  1. Almost all climate predictions are based on the so-called “business as usual” model, yet this model is based on assumptions that are unrealistically pessimistic. (Source: Nature)
  2. African populations have Neanderthal ancestry. This might be indicative that people from Europe went back to Africa over the centuries.
  3. The oceans are not acidic, their PH is about 8.0. (Source: Forbes)
  4. As we age, we tend to accumulate “senescent cells”: these are cells that should be dead but somehow linger around. We know how to kill them selectively with some success. Recent research suggests that brain neurodegeneration might be caused by senescent brain cells. Neurons themselves do not become senescent, as far as I know, but supporting cells like astrocytes do. This suggests that we may improve brain health in older people with therapies to remove senescent cells. Similarly, senescent cells might cause osteoporosis, the phenomenon whereas the bones of older people become brittles.
  5. Mitochondria are small cells that live in our own cells and produce our energy. We have now learned that mitochondria also live inside the blood stream. It is unclear what purpose they serve, if any.
  6. How frequently a couple kisses is a strong predictor of relationship satisfaction.
  7. Researchers have transplanted stem cells as “cell sheets” on the heart of human patients. The hope is that these sheets contribute to repair the heart of the patients.
  8. Most of us at some point suffer from atherosclerosis, the accumulation of plaques in our arteries. Researchers developed a nanotherapy that helps reduce atherosclerosis in mice.

Cost of a thread in C++ under Linux

Almost all our computers are made of several processing cores. Thus it can be efficient to “parallelize” expensive processing in a multicore manner. That is, instead of using a single core to do all of the work, you divide the work among multiple cores. A standard way to approach this problem is to create threads.

A C++ thread object executes some functions, possibly on a thread created by the operating system, and goes away. If you wanted to increment a counter using a C++ thread, you could do it in this manner:

auto mythread = std::thread([] { counter++; });
mythread.join();

It is obviously silly code that you should never use as is, it is a mere illustration. Creating a new thread is not free. Exactly how expensive it might be depends on several parameters. But can we get some rough idea?

For this purpose, I wrote a small benchmark where I just create a thread, increment a counter and let the thread die. It is the time elapsed while waiting for the thread to run its course. My program computes the mean and standard error of the time, as well as the minimum and maximum duration of the test. For simplicity, I am just going to report the means:

system time per thread
Ampere server (Linux, ARM) 200,000 ns
Skylake server (Linux, Intel) 9,000 ns
Rome server (Linux, AMD) 20,000 ns

I am deliberately not going into the details of the compiler, system library, operating system, RAM and all that fun stuff. You should not look at my table and make far reaching conclusions.

What is clear, however, is that creating a thread may cost thousands of CPU cycles. If you have a cheap function that requires only hundreds of cycles, it is almost surely wasteful to create a thread to execute it. The overhead alone is going to set you back.

There are clever ways to amortize the cost of a thread. For example, you may avoid the constant creation of new threads as in my benchmark. Yet to amortize, you still need to have enough work to make it worthwhile.

Science and Technology links (January 25th 2020)

  1. Scientists found a way to increase the production of new neurons in the brains of mice, effectively rejuvenating the brains of old mice. (Source: Nature)
  2. How many people died during accidents at nuclear power plants? In North America, the most famous accident happened at Three Mile Island when a reactor overheated, shut down, and small amount of harmless radioactive gas was released. Nobody was harmed.
    In the recent Fukushima accident in Japan, about 50 people died directly as a result of the event. Many more people died following the event, but they did not die due to radiations or exposure to radioactive material. Instead, they mostly died due to the evacuation process, and most of the deaths where elderly or sick people from whom the evacuation was traumatic. It is estimated that infants in the region will see their cancer risk raised by 1% over their lifetime. In comparison the nuclear accident in Chernobyl was more tragic: thousands of people probably died as a result. Should you conclude that nuclear energy is risky? In the seventies, a dam collapsed and killed an estimated 230,000 people. Thus hydro-electric power is far more deadly than nuclear power, when just accounting for this one incident. And, of course, coal power plants a far more deadly than either hydro or nuclear. So we are irrational with respect to our fear of nuclear power. Speaking for myself, I would gladly live and work in a nuclear power plant, though maybe not one designed by the USSR. Furthermore, we now have the technology to build small nuclear reactors that would be orders of magnitude safer, if only we allow it.
  3. Academic researchers continue to fail to report the results of clinical trials in the USA, breaking the law in the process. This can be verified since clinical trials are public events, posted on the Web with the name of the researchers. It would be trivial to identify by name the academic researchers who fail to report the results of their clinical trial. These same clinical trials are typically funded by the government. The lack of compliance has been reported for years.
  4. Japan has 70,000 centenarians.
  5. Topical application of a drug commonly used as an immunodepressor (Rapamycin) appears to rejuvenate the skin of older people.
  6. Researchers were able to regenerate the optical nerve of mice after crushing it. It is believed to be an indication that we could reverse age-related vision loss due to glaucoma or other diseases.
  7. Exercise keeps you healthier, but how the benefits work is a bit unclear. Researchers found that by removing genetically some family of proteins (Sestrins) from flies and mice, they were able to negate the benefits of exercise. In contrast, increasing the production of Sestrins appear to emulate the benefits of exercise. (Source: Nature)
  8. Researchers claim to have found a new form of immunotherapy that would be effective against all cancers. However, it is an early stage and we are not even considering clinical trials at this time.

Filling large arrays with zeroes quickly in C++

Travis Downs reports that some C++ compilers have trouble filling up arrays with values at high speed. Typically, to fill an array with some value, C++ programmers invoke the std::fill. We might assume that for a task as simple as filling an array with zeroes, the C++ standard library would provide the absolute best performance. However, with GNU GCC compilers, that is not the case.

The following line of C++ code gets compiled as a loop that fills each individual byte with zeroes when applying the conventional -O2 optimization flag.

std::fill(p, p + n, 0);

When the array is large, it can become inefficient compared to a highly efficient implementation like the C function like memset.

memset(p, 0, n);

Travis reports that the difference is a factor of thirty. He also explains how to fix the C++ so that it is as fast as the memset function.

What about very large arrays? What if you need to write zeroes all over a 2GB array?

memset 30 GB/s
std::fill 1.7 GB/s

Roughly speaking, the memset function is 15 times faster than std::fill in my test. This demonstrates that even if you are working with large volumes of data, you can be easily limited by your inefficient implementations. You are not automatically limited by your memory bandwidth. In fact, std::fill in this instance cannot even keep up with a good network adaptor or a fast disk. We routinely parse JSON data, and write out DOM trees, at speeds far higher than 1.7 GB/s.

Another lesson I would derive from this example is that writing your code in idiomatic C++ is not sufficient to guarantee good performance. There is a lot of engineering involved in making C++ fast, and though it often works out magically, it can fail too.

My source code is available.

Allocating large blocks of memory: bare-metal C++ speeds

In a previous post, I benchmarked the allocation of large blocks of memory using idiomatic C++. I got a depressing result: the speed could be lower than 2 GB/s. For comparison, the disk in my laptop has greater bandwidth.

Methodologically, I benchmarked the “new” operator in C++ with initialization, using the GNU GCC compiler with the -O2 optimization flag1.

char *buf = new char[s]();

It turns out that you can do better while sticking with C++. We cannot simply call the new operator without initialization because, in general, it does not result in the memory being actually allocated. However, we can allocate the memory and then make sure that we touch every “page” of memory. On modern Intel systems, pages are effectively always at least as large of 4kB, so we can touch the memory every 4kB. We might call this approach “new and touch”.

char * buf = new char[size];
for (size_t i = 0; i < size; i += 4096) buf[i] = 0;
buf[size - 1] = 0;

Such a new-and-touch strategy should be close to “bare-metal” memory allocation speeds. So how fast is it? It depends on the page size. By default, most systems rely on small (4kB) pages. Allocating many small pages is expensive. Thankfully, Linux can be configured to use huge pages, transparently, via a feature called “transparent huge pages”. And it makes a huge difference!

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 3.9 GB/s 30 GB/s
transparent huge pages 20 GB/s 30 GB/s

I am using a recent Linux system (Ubuntu 16.04), a Skylake processor and GNU GCC 8.3 with the -O2 optimization flag. My source code is available.

It is still the case that allocating memory on most systems is a non-trivial cost since they rely on small 4kB pages. There are fast disks available on the market that have more than 4 GB/s of bandwidth.

Credit: Thanks to Travis Downs and others for their insights and comments.

 

1 Downs found that we get far better performance out of the new operator with initialization under GNU GCC with the more agressive -O3 optimization flag. Performance-wise, it should be close to the “new and touch” approach that I am describing.

How fast can you allocate a large block of memory in C++?

In C++, the most basic memory allocation code is just a call to the new operator:

char *buf = new char[s];

According to a textbook interpretation, we just allocated s bytes1.

If you benchmark this line of code, you might find that it almost entirely free on a per-byte basis for large values of s. But that is because we are cheating: the call to the new operation “virtually” allocates the memory, but you may not yet have actual memory that you can use. As you access the memory buffer, the system may then decide to allocate the memory pages (often in blocks of 4096 bytes). Thus the cost of memory allocation can be hidden. The great thing with a virtual allocation is that if you never access the memory, you may never pay a price for it.

If you actually want to measure the memory allocation in C++, then you need to ask the system to give you s bytes of allocated and initialized memory. You can achieve the desired result in C++ by adding parentheses after the call to the new operator:

char *buf = new char[s]();

Then the operating system actually needs to allocate and initialize memory2. It may still cheat in the sense that it may recycle existing blocks of memories or otherwise delay allocation. And I expect that it might do so routinely if the value of s is small. But it gets harder for the system to cheat as s grows larger.

What happens if you allocate hundreds of megabytes in such a manner? The answer depends on the size of the pages. By default, your system probably uses small (4kB) pages. Under Linux, you can enable “transparent huge pages” which dynamically switches to large pages when large blocks of memory are needed. Using larger pages means having to allocate and access fewer pages, so it tends to be cheaper.

In both instances, I get around a couple of gigabytes per second on a recent Linux system (Ubuntu 16.04) running a conventional Skylake processor. For comparison, you can set memory to zero at tens gigabytes per second and my disk can feed data to the system at more than 2 GB/s. Thus, at least on the system I am currently using, memory allocation is not cheap. My code is available, I use GNU GCC 8.3 with the -O2 optimization flag.

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 1.6 GB/s 30 GB/s
transparent huge pages 2.4 GB/s 30 GB/s

You can do better with different C++ code, see my follow-up post Allocating large blocks of memory: bare-metal C++ speeds.

Further remarks. Of course, you can reuse the allocated memory for greater speeds. The memory allocator in my standard library could possibly do this already when I call the new operator followed by the delete operator in a loop. However, you still need to allocate the memory at some point, if only at the beginning of your program. If you program needs to allocate 32 GB of memory, and you can only do so at 1.4 GB/s, then your program will need to spend 23 seconds on memory allocation alone.

Footnotes:

  1. Several readers have asked why I am ignoring C functions like calloc, malloc and mmap. The reason is simple: this post is focusing on idiomatic C++.
  2. You might wonder why the memory needs to be initialized. The difficulty has to do with security. Though it is certainly possible for the operating system to give you direct access to memory used by another process, it cannot also pass along the data. Thus it needs to erase the data. To my knowledge, most systems achieve this result by zeroing the new memory.