Fast divisionless computation of binomial coefficients

Suppose that I give you a set of n objects and I ask you to pick k distinct objects, thus forming a new subset. How many such subsets are there? If you have taken college mathematics, you have probably learned that the answer is given by binomial coefficients. They are defined as n!/(k! * (n-k)!) where the exclamation point refers to the factorial. For a programmer, it is easier to think of binomial coefficients as the result of this loop:

def binom(n,k):
    top = 1
    bottom = 1
    for i in 1, 2, ..., k:
        top *= n - i + 1
        bottom *= i
    return top/bottom

Though simple enough, this algorithmic definition is not practical if you are interested in performance. Both the numerator and the denominator grow large quickly. They will soon require several machine words. A programming language like Python will happily give you the correct answer, but slowly. In Java or C, you are likely to get the wrong result due to a silent overflow.

Of course, if you know that the binomial coefficient is too large to fit in a machine word (64 bits), then you may as well go to a big-integer library. But what if you know that the result fits in a machine word? Maybe you have a reasonable bound on the size of n.

Then instead of waiting at the very end before doing the division, you can divide at each iteration in the loop:

def binom(n,k):
    answer = 1
    for i in 1, 2, ..., k:
        answer = answer * (n - k + 1) / k
    return answer

This new approach may still overflow even if the binomial coefficient itself fits in a machine word because we multiply before dividing. You can get around this issue by first finding a common divisor to both the multiplier and the divisor, and factoring it out. Or else, you can further restrict the values of n and k. Let us choose this last path.

We still have as a problem that we need k-1 multiplications and divisions. The multiplications are relatively cheap, but the divisions have longer latencies. We would prefer to avoid divisions entirely. If we assume that k is small, then we can just use the fact that we can always replace a division by a known value with a shift and a multiplication. All that is needed is that we precompute the shift and the multiplier. If there are few possible values of k, we can precompute it with little effort.

Hence, if you know that, say, n is smaller than 100 and k smaller than 10, the following function will work…

uint64_t fastbinomial(int n, int k) {
  uint64_t np = n - k;
  uint64_t answer = np + 1;
  for (uint64_t z = 2; z <= (uint64_t)k; z++) {
    answer = answer * (np + z); 
    fastdiv_t f = precomputed[z];
    answer >>= f.shift;
    answer *= f.inverse;
  }
  return answer;
}

I provide a full portable implementation complete with some tests. Though I use C, it should work as-is in many other programming languages. It should only take tens of CPU cycles to run. It is going to be much faster than implementations relying on divisions.

Another trick that you can put to good use is that the binomial coefficient is symmetric: you can replace k by nk and get the same value. Thus if you can handle small values of k, you can also handle values of k that are close to n. That is, the above function will also work for n is smaller than 100 and k larger than 90, if you just replace k by nk.

Is that the fastest approach? Not at all. Because n is smaller than 100 and k smaller than 10, we can precompute (memoize) all possible values. You only need an array of 1000 values. It should fit in 8kB without any attempt at compression. And I am sure you can make it fit in 4kB with a little bit of compression effort. Still, there are instances where relying on a precomputed table of several kilobytes and keeping them in cache is inconvenient. In such cases, the divisionless function would be a good choice.

Alternatively, if you are happy with approximations, you will find floating-point implementations.

Science and Technology links (February 22nd 2020)

    1. In a large cohort study, the highest probability of reaching 90 years old was found for those drinking between 5g and 15 g of alcohol per day. This does not mean that if you are not drinking, you should start.
    2. The Earth is getting greener thanks to CO2. In turn, a greener Earth will mitigate global warming. (Source: Nature)
    3. In 2019, the carbon emissions in the US fell by 2.9%. They fell by 5% in the European Union. They also fell in Japan. (Source: Bailey)
    4. Robots can take blood samples and apparently do competent job, according to a clinical trial.
    5. We may soon get 80-terabyte disk drives.
    6. The age-adjusted cancer rate in the US is currently about at the same level as it was in 1930. We are not winning the war against cancer.
    7. You are better off cutting your food on wooden planks, they are more hygienic that plastic planks.
    8. Science is undergoing what some people call the “reproducibility crisis”: may important results reported in prestigious venues cannot be reproduced by other scientists, independently. Miyakawa suggests that the reproducibility crisis might be related to the fact that studies are frequently fabricated:

      (…) more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning.

      A few years ago, I was on the PhD committee of a student. I questioned the results. Ultimately, we asked for the software that produced that data. The student quickly reported that the software had been lost, deleted by the University. We declined to grant the PhD despite an extensive publication record (with articles in some of the best venues). I paid a political price for my choice to fail the student. The student eventually did get a PhD after an appeal. I would not be surprised to learn that this student became a professor. The lesson is that you should always doubt scientific studies. Ask that they be independently reproduced.

My thoughts on how research funding is attributed in Canada to Computer Science

In Canada, most computer science professors seek funding with NSERC, the main Canadian funding agency for science and engineering. It is more or less the equivalent of the American NSF. The core NSERC program is called “discovery” and it funds 5-year research programs. So, roughly speaking, currently funded professors apply for funding about once every 5 years. Once funded, you do not have to pursue the program you proposed: we recognize that you cannot be expected to stupidly stay on the same course for five years in a fast moving field.

Applicants get a rating regarding the excellence of the researcher, the merit of their proposal and on how well they are training students. It is an all-around evaluation. It is quite stressful.

Not all professors are funded or seek funding. However, it is common for computer science departments to expect their new faculty to seek funding. At many places, getting funding is a necessary condition to get tenure. I would expect it to be the case at all top departments. In effect, the NSERC discovery program act as a Canada-wide peer review process.

The grants are modest: from about 20k$ a year to 100k$. Very few people get 100k$ a year, you basically have to be a Turing award recipient. So it is not a lot of money compared to what American professors get. In most cases, all the money goes to students. Professors cannot pay themselves. So getting a grant does not increase your salary.

In computer science, applications go to a committee made of between 40 to 50 people. Most are from Canadian universities, but there are some people from industry and from other countries. Each application is reviewed by a team of five committee member, supervised by a chair (who is also a member of the committee) as well as at least one program officer. There are also external reviews which are taken seriously, but are just one element among many. The applicants must provide samples of their research; they committee members browse and discuss these papers. And there is a 5-page proposal describing the science that the applicant wants to pursue.

I just finished a term as co-president of the committee. It is a lot of work. I could probably have written five more papers these last three years without this service responsibility. Let me add that it is unpaid.

Here are my take-away from the experience:

  1. We often hear that science is all about publishing lots and lots of papers. That is definitively not true. Once you put a bunch of professional computer scientist in a room and you tell them to judge someone… they quickly dismiss sheer volume. They seek quality. They seek actual science. They also tend to go with proxies, as in “they published in the best venue”. Yet, even there, it is not so much the volume that matters as the fact that specific journals and conferences are especially selective. Committee members are eager for explanations as to why the research is great; it is often the applicants themselves who are not forthcoming about details. If you wrote about a breakthrough in an online note or presented it at a local workshop, your peers will be happy to consider it, though you have a bit more explaining to do than if it appeared in a prestigious venue. And it is definitively possible to get the highest ratings without a pursuit of prestigious venues.
  2. We take services to the community and to society very seriously. It is not all about papers.
  3. I don’t think bibliometrics like citations get discussed much at all: again, it is not about numbers. Being highly cited can be good, but it is not the end game.
  4. It is surprising how even the most experienced researchers sometimes cannot describe competently a research proposal. Sometimes there are goals, but no means to achieve them. Sometimes the objectives are all over the map and incoherent. People will mix and match ideas in the most absurd manner.
  5. The committee is quite open to new and bold ideas. In fact, it rewards bold ideas if they are presented in a competent manner.
  6. Years after years, I saw “old ideas” being praised when put into a solid program. Not everything good has to be about bitcoins and deep learning. That is, it is not required that you work on fashionable ideas. The committee has a lot of tolerance for unfashionable ideas.
  7. People who try to “pad” their resume to look impressive take risks. Committee members get annoyed very quickly at people gaming the system.
  8. Everyone gets assessed on the same grid. It does not matter whether you are at the best or worst university. It does not matter if you are 20 years old or 70 years old. Evidently, it does not matter whether you are man or not. So it is a system that is hard on younger, less experienced professors. It is also hard for people from small institutions. If you are both inexperienced and from a small institution, it is especially difficult. The counterpart is that if you do well while being at a small institution, you should be especially proud of yourself.
  9. Unfortunately, as with every year, many professors will get bad news. They will have failed to get funding. This may mean that they will soon lose their job. Some people are just bad at what they do or they have bad ideas: it is a service to tell them loud and clear. But most people who fail to receive funding at not bad. In many cases, the merit of their work was clear. This is not a perfect system, nor can it be. Some people simply have not yet had the means to reach the thresholds set by the grid. Some people do work that just do not fit well with the grid. This makes me sad and slightly depressed, but there is no easy fix. In my department, we specifically do not require that new professors get a discovery grant to receive tenure. We do our own assessment.
  10. It is definitively not all about politics. I have heard rumours about people from a given school of thought trying to sink people from another school, or people trying to be advocate for their own speciality. I am sure it happens. However, when it is observed, it does get some attention. Nothing is ever perfect, but we don’t let politics take over.

All in all, I feel better about my peers and about computer science after this experience at NSERC. I am generally very concerned about quality in research. There is a lot of junk out there. Yet there is also a lot of good people try to do good work. My expectation is that the Canadian system is probably one of the best in the world because it puts quality and good science first.

 

Further reading: Ebrahim Bagheri was a fellow committee member and he wrote about his experience on twitter.

Science and Technology links (February 8th 2020)

  1. It is often believed that radiations are bad for you. To the contrary, David et al. report that life expectancy is approximately 2.5 years longer in people living in areas with an elevated background radiation in the USA.
  2. Birth order, that is whether you are the oldest or youngest sibling, is independently associated with a number of health and performance metrics. The first born is usually more fit and stable. Lin et al. argue that the main explanation might be that younger siblings are more likely to have been unwanted.
  3. The University of California has completely cancelled its subscription to research papers by Elsevier. Elsevier is one of the most important scientific publisher. It is also a for-profit publisher.
  4. Low levels of Low-density lipoprotein cholesterol (LDL-C), often qualified as “bad cholesterol”, are strongly and independently associated with increased risk of cancer, cardiovascular diseases, and all-cause mortality according to a Korean study made of hundreds of thousands of human subjects. This new study puts into question mainstream beliefs about “bad cholesterol”.
  5. Education is strongly associated with better health and better longevity. However, after controlling for income and living conditions, the relationship between health and education evaporates.
  6. Harvard’s George Church has created a startup that will use an anti-aging gene therapy to increase the longevity of dogs. It is based on previous work done on mice and reported in a paper entitled A single combination gene therapy treats multiple age-related diseases (source: PNAS).
  7. As we age, 90% of us will get gray hair. It is often believed to be an irreversible process. Researchers at the University of Alabama found strong evidence that it is not the case and they believe that hair graying can be reversed. They are launching a company to develop the therapy. Note that there is documented but anecdotal evidence for gray-hair reversal, e.g., in cancer patients.

Research should not stop with the research paper

The practice of academic research is based on the production of formal documents that undergo formal reviewers by peers. We routinely evaluate academics for jobs and promotions based on their publication output. When asked about their contribution to science, many academics are happy to point at their papers. In some cases, they will also point at the grants that they have secured.

Back when I worked at NRC, a friend of mine, Martin Brooks gave a controversial talk entitled “user-based research”. It has been nearly 20 years, and I still remember this talk. His ideas were so upsetting that people left while he was talking. I stayed and listened. It took me years to own the ideas he expressed that day.

Martin complained that researchers mostly “threw research over the wall” expecting that other people (maybe industry) would pick up the research from the other side of the wall. While it certainly can happen that others will read your paper and apply it, you should not count on it. Nobody has to read and understand your work.

When I was a bit younger, a senior professor pointed out that some idea we were discussed had been described and tested by his team in some prestigious venue decades ago but that nobody ever did anything with it. Without thinking, I replied:   it was your job to do something with it, don’t blame others. The room fell silent and there was a long pause.

I am not saying that if you find a way to cure cancer in mice, it is your job to get the therapy cleared for use with human beings and to open a hospital where the therapy is delivered. A single individual can only do so much.

What I am saying, however, is that publishing a research paper is not the goal of research. It is not the final output. It is only one element in a chain. And it is not even a requirement. You need to communicate your idea, but the peer reviewed journal article is just one such mean.

Your actual goal is “transfer”. That is, someone, somewhere, must put your ideas in practice beyond the publication of your paper. It does not mean “industrial applications” though it can be that. If your idea is worthwhile and you let it end with a research paper, you have failed.

And it does not merely mean collecting “citations” or other bibliometrics. People routinely cite papers without reading them. Few citations are influential.

But the academic incentives almost conspire to prevent impactful research. There is one specific criteria that academics like to apply that is destructive: novelty. For some piece of academic work to be judged worthwhile, it must be novel. I will say it again and again: originality is overrated.

Of course, people entering a new field tend to “rediscover known facts”. You have to push them back and tell them to go study some more. But there is a difference between naivité and lack of originality. You have to be aware of the history of your field, that is what scholarship is all about. But you also have to stick with ideas for the long run, until the fruits appear.

Instead of rewarding novelty, we should reward scholarship: we should ask people to show that they have studied and documented the past. We should never penalize someone who works on pushing a known idea by refining it, communicating it, validating it.

This idea that some venerable professor had 20 years ago that never went anywhere? Well, it might be entirely worthwhile to revisit it and make it go somewhere, even if it is not novel at all.

Further reading: Boote, D. N., & Beile, P. (2005). Scholars before researchers: On the centrality of the dissertation literature review in research preparation. Educational researcher, 34(6), 3-15.

Science and Technology links (February 1st 2020)

  1. Almost all climate predictions are based on the so-called “business as usual” model, yet this model is based on assumptions that are unrealistically pessimistic. (Source: Nature)
  2. African populations have Neanderthal ancestry. This might be indicative that people from Europe went back to Africa over the centuries.
  3. The oceans are not acidic, their PH is about 8.0. (Source: Forbes)
  4. As we age, we tend to accumulate “senescent cells”: these are cells that should be dead but somehow linger around. We know how to kill them selectively with some success. Recent research suggests that brain neurodegeneration might be caused by senescent brain cells. Neurons themselves do not become senescent, as far as I know, but supporting cells like astrocytes do. This suggests that we may improve brain health in older people with therapies to remove senescent cells. Similarly, senescent cells might cause osteoporosis, the phenomenon whereas the bones of older people become brittles.
  5. Mitochondria are small cells that live in our own cells and produce our energy. We have now learned that mitochondria also live inside the blood stream. It is unclear what purpose they serve, if any.
  6. How frequently a couple kisses is a strong predictor of relationship satisfaction.
  7. Researchers have transplanted stem cells as “cell sheets” on the heart of human patients. The hope is that these sheets contribute to repair the heart of the patients.
  8. Most of us at some point suffer from atherosclerosis, the accumulation of plaques in our arteries. Researchers developed a nanotherapy that helps reduce atherosclerosis in mice.

Cost of a thread in C++ under Linux

Almost all our computers are made of several processing cores. Thus it can be efficient to “parallelize” expensive processing in a multicore manner. That is, instead of using a single core to do all of the work, you divide the work among multiple cores. A standard way to approach this problem is to create threads.

A C++ thread object executes some functions, possibly on a thread created by the operating system, and goes away. If you wanted to increment a counter using a C++ thread, you could do it in this manner:

auto mythread = std::thread([] { counter++; });
mythread.join();

It is obviously silly code that you should never use as is, it is a mere illustration. Creating a new thread is not free. Exactly how expensive it might be depends on several parameters. But can we get some rough idea?

For this purpose, I wrote a small benchmark where I just create a thread, increment a counter and let the thread die. It is the time elapsed while waiting for the thread to run its course. My program computes the mean and standard error of the time, as well as the minimum and maximum duration of the test. For simplicity, I am just going to report the means:

system time per thread
Ampere server (Linux, ARM) 200,000 ns
Skylake server (Linux, Intel) 9,000 ns
Rome server (Linux, AMD) 20,000 ns

I am deliberately not going into the details of the compiler, system library, operating system, RAM and all that fun stuff. You should not look at my table and make far reaching conclusions.

What is clear, however, is that creating a thread may cost thousands of CPU cycles. If you have a cheap function that requires only hundreds of cycles, it is almost surely wasteful to create a thread to execute it. The overhead alone is going to set you back.

There are clever ways to amortize the cost of a thread. For example, you may avoid the constant creation of new threads as in my benchmark. Yet to amortize, you still need to have enough work to make it worthwhile.

Science and Technology links (January 25th 2020)

  1. Scientists found a way to increase the production of new neurons in the brains of mice, effectively rejuvenating the brains of old mice. (Source: Nature)
  2. How many people died during accidents at nuclear power plants? In North America, the most famous accident happened at Three Mile Island when a reactor overheated, shut down, and small amount of harmless radioactive gas was released. Nobody was harmed.
    In the recent Fukushima accident in Japan, about 50 people died directly as a result of the event. Many more people died following the event, but they did not die due to radiations or exposure to radioactive material. Instead, they mostly died due to the evacuation process, and most of the deaths where elderly or sick people from whom the evacuation was traumatic. It is estimated that infants in the region will see their cancer risk raised by 1% over their lifetime. In comparison the nuclear accident in Chernobyl was more tragic: thousands of people probably died as a result. Should you conclude that nuclear energy is risky? In the seventies, a dam collapsed and killed an estimated 230,000 people. Thus hydro-electric power is far more deadly than nuclear power, when just accounting for this one incident. And, of course, coal power plants a far more deadly than either hydro or nuclear. So we are irrational with respect to our fear of nuclear power. Speaking for myself, I would gladly live and work in a nuclear power plant, though maybe not one designed by the USSR. Furthermore, we now have the technology to build small nuclear reactors that would be orders of magnitude safer, if only we allow it.
  3. Academic researchers continue to fail to report the results of clinical trials in the USA, breaking the law in the process. This can be verified since clinical trials are public events, posted on the Web with the name of the researchers. It would be trivial to identify by name the academic researchers who fail to report the results of their clinical trial. These same clinical trials are typically funded by the government. The lack of compliance has been reported for years.
  4. Japan has 70,000 centenarians.
  5. Topical application of a drug commonly used as an immunodepressor (Rapamycin) appears to rejuvenate the skin of older people.
  6. Researchers were able to regenerate the optical nerve of mice after crushing it. It is believed to be an indication that we could reverse age-related vision loss due to glaucoma or other diseases.
  7. Exercise keeps you healthier, but how the benefits work is a bit unclear. Researchers found that by removing genetically some family of proteins (Sestrins) from flies and mice, they were able to negate the benefits of exercise. In contrast, increasing the production of Sestrins appear to emulate the benefits of exercise. (Source: Nature)
  8. Researchers claim to have found a new form of immunotherapy that would be effective against all cancers. However, it is an early stage and we are not even considering clinical trials at this time.

Filling large arrays with zeroes quickly in C++

Travis Downs reports that some C++ compilers have trouble filling up arrays with values at high speed. Typically, to fill an array with some value, C++ programmers invoke the std::fill. We might assume that for a task as simple as filling an array with zeroes, the C++ standard library would provide the absolute best performance. However, with GNU GCC compilers, that is not the case.

The following line of C++ code gets compiled as a loop that fills each individual byte with zeroes when applying the conventional -O2 optimization flag.

std::fill(p, p + n, 0);

When the array is large, it can become inefficient compared to a highly efficient implementation like the C function like memset.

memset(p, 0, n);

Travis reports that the difference is a factor of thirty. He also explains how to fix the C++ so that it is as fast as the memset function.

What about very large arrays? What if you need to write zeroes all over a 2GB array?

memset 30 GB/s
std::fill 1.7 GB/s

Roughly speaking, the memset function is 15 times faster than std::fill in my test. This demonstrates that even if you are working with large volumes of data, you can be easily limited by your inefficient implementations. You are not automatically limited by your memory bandwidth. In fact, std::fill in this instance cannot even keep up with a good network adaptor or a fast disk. We routinely parse JSON data, and write out DOM trees, at speeds far higher than 1.7 GB/s.

Another lesson I would derive from this example is that writing your code in idiomatic C++ is not sufficient to guarantee good performance. There is a lot of engineering involved in making C++ fast, and though it often works out magically, it can fail too.

My source code is available.

Allocating large blocks of memory: bare-metal C++ speeds

In a previous post, I benchmarked the allocation of large blocks of memory using idiomatic C++. I got a depressing result: the speed could be lower than 2 GB/s. For comparison, the disk in my laptop has greater bandwidth.

Methodologically, I benchmarked the “new” operator in C++ with initialization, using the GNU GCC compiler with the -O2 optimization flag1.

char *buf = new char[s]();

It turns out that you can do better while sticking with C++. We cannot simply call the new operator without initialization because, in general, it does not result in the memory being actually allocated. However, we can allocate the memory and then make sure that we touch every “page” of memory. On modern Intel systems, pages are effectively always at least as large of 4kB, so we can touch the memory every 4kB. We might call this approach “new and touch”.

char * buf = new char[size];
for (size_t i = 0; i < size; i += 4096) buf[i] = 0;
buf[size - 1] = 0;

Such a new-and-touch strategy should be close to “bare-metal” memory allocation speeds. So how fast is it? It depends on the page size. By default, most systems rely on small (4kB) pages. Allocating many small pages is expensive. Thankfully, Linux can be configured to use huge pages, transparently, via a feature called “transparent huge pages”. And it makes a huge difference!

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 3.9 GB/s 30 GB/s
transparent huge pages 20 GB/s 30 GB/s

I am using a recent Linux system (Ubuntu 16.04), a Skylake processor and GNU GCC 8.3 with the -O2 optimization flag. My source code is available.

It is still the case that allocating memory on most systems is a non-trivial cost since they rely on small 4kB pages. There are fast disks available on the market that have more than 4 GB/s of bandwidth.

Credit: Thanks to Travis Downs and others for their insights and comments.

 

1 Downs found that we get far better performance out of the new operator with initialization under GNU GCC with the more agressive -O3 optimization flag. Performance-wise, it should be close to the “new and touch” approach that I am describing.

How fast can you allocate a large block of memory in C++?

In C++, the most basic memory allocation code is just a call to the new operator:

char *buf = new char[s];

According to a textbook interpretation, we just allocated s bytes1.

If you benchmark this line of code, you might find that it almost entirely free on a per-byte basis for large values of s. But that is because we are cheating: the call to the new operation “virtually” allocates the memory, but you may not yet have actual memory that you can use. As you access the memory buffer, the system may then decide to allocate the memory pages (often in blocks of 4096 bytes). Thus the cost of memory allocation can be hidden. The great thing with a virtual allocation is that if you never access the memory, you may never pay a price for it.

If you actually want to measure the memory allocation in C++, then you need to ask the system to give you s bytes of allocated and initialized memory. You can achieve the desired result in C++ by adding parentheses after the call to the new operator:

char *buf = new char[s]();

Then the operating system actually needs to allocate and initialize memory2. It may still cheat in the sense that it may recycle existing blocks of memories or otherwise delay allocation. And I expect that it might do so routinely if the value of s is small. But it gets harder for the system to cheat as s grows larger.

What happens if you allocate hundreds of megabytes in such a manner? The answer depends on the size of the pages. By default, your system probably uses small (4kB) pages. Under Linux, you can enable “transparent huge pages” which dynamically switches to large pages when large blocks of memory are needed. Using larger pages means having to allocate and access fewer pages, so it tends to be cheaper.

In both instances, I get around a couple of gigabytes per second on a recent Linux system (Ubuntu 16.04) running a conventional Skylake processor. For comparison, you can set memory to zero at tens gigabytes per second and my disk can feed data to the system at more than 2 GB/s. Thus, at least on the system I am currently using, memory allocation is not cheap. My code is available, I use GNU GCC 8.3 with the -O2 optimization flag.

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 1.6 GB/s 30 GB/s
transparent huge pages 2.4 GB/s 30 GB/s

You can do better with different C++ code, see my follow-up post Allocating large blocks of memory: bare-metal C++ speeds.

Further remarks. Of course, you can reuse the allocated memory for greater speeds. The memory allocator in my standard library could possibly do this already when I call the new operator followed by the delete operator in a loop. However, you still need to allocate the memory at some point, if only at the beginning of your program. If you program needs to allocate 32 GB of memory, and you can only do so at 1.4 GB/s, then your program will need to spend 23 seconds on memory allocation alone.

Footnotes:

  1. Several readers have asked why I am ignoring C functions like calloc, malloc and mmap. The reason is simple: this post is focusing on idiomatic C++.
  2. You might wonder why the memory needs to be initialized. The difficulty has to do with security. Though it is certainly possible for the operating system to give you direct access to memory used by another process, it cannot also pass along the data. Thus it needs to erase the data. To my knowledge, most systems achieve this result by zeroing the new memory.

Science and Technology links (January 11th 2020)

  1. The extra wealth that ones acquires by attending college is now estimated to be indistinguishable from zero. The authors control for the parent’s education and the individual’s financial acumen.
  2. A drug approved for use in human beings (Rapamycin) appears to rejuvenate dental health in mice.
  3. Low-fat milk increases the obesity risk of children when compared with high-fat milk.
  4. It appears that type 2 diabetes might caused by fat. Keeping your abdominal fat in check might be the best way to prevent diabetes (speculative?).

How I teach database design

Most software runs on top of databases. These databases are organized logically, with a schema, that is a formal description. You have entities (your user), attributes (your user’s name) and relationships between them.

Typical textbook database design comes from an era when it was possible to believe that you knew, at the start of the project, what the data would be like. And you could believe that it would not change. You could certainly believe that your 200-page design document would be updated and maintained, and it would be compulsory reading for computer folks in your company for decades. When building an accounting application, you would query the databases using SQL. All of the data was accessed through the databases. The database was the interface. And the same set of centralized databases running on a single computer would serve the entire operation. And everyone would know what the field “name” meant, exactly, in the table “user”. There would not be mergers between companies, or corporate pivots taking you into a new direction. You could believe that your conceptual and logical principles were independent from physical implementation principles. There would be no need to ever duplicate data for performance.

Quite often, it is not how systems are built today, if they ever were built this way. You organize functions in your organization in terms of services. And the services communicate between themselves. People who do not have to work directly on your service do not need to know how the data is organized or even if the organization is publicly documented at all. You have to provide them an interface, and that is all that they should need. On your end, you cannot rely on the database engine alone to ensure that the data remains usable. “But my data follows a strict schema” is not an excuse for failure. Your users do not want to know about your data layout, they do not want to have to know.

When organizing your data, you will probably get it wrong. If not today, then next week or next year. Life is fast changing and your knowledge is always incomplete. The waterfall model is naive to the point of being actively harmful. So you want to isolate the data layout from its use as much as you can.

In textbooks, the core principle of database design is often “do not duplicate data”. The logic is sound: if you duplicate data, at some point these values may go out of sync and what do you do then? If I see a programmer duplicating data all over his code, I know he lacks experience. However, it is also the case that deliberate duplication is essential. Sometimes you need duplication for performance and performance matters. Sometimes you need duplication to avoid unwieldy complexity, and managing complexity matters. And you need duplication because your information system runs on more than one processor.

You want to store everything in one giant system so that there is no duplication and everything is consistent? This thinking does not scale. It does not scale with respect to complexity or performance. And it does not even achieve consistency in the real world.

The data you will be receiving and maintaining is not pristine. In fact, it is full of errors and unknowns. Marketing thinks that John Smith’s phone number if something whereas accounting thinks that it is something else? You know what? Maybe they are both right or maybe they are both wrong. And it is worse than you think because there are unknown unknowns: nothing in your database can be regarded without suspicion. And the minute you have incomplete or uncertain data, your dream of enforcing the relationships between entities may just become naive.

Does that mean that anything goes and that you don’t need to worry about database design? Quite the opposite: you ought to be agonizing over every decision and making sure that you never locking yourself into a corner. Because you will have duplication and multiple systems, you will have inconsistency and you will need to deal with them, and a single strategy won’t work for all cases.

So how do I teach database design? I ask students to be critical about the textbook material, and I expose them as much as possible to real-world example where things are not perfectly neat. Begin by looking under the hood of your favorite open-source project.

Further reading: On Desirable Semantics of Functional Dependencies over Databases with Incomplete Information, Fundamenta Informaticae 158 (2018); Functional dependencies with null markers, Computer Journal 58 (2015); A Call to Arms: Revisiting Database Design, SIGMOD Record 40 (2011).

My Science and Technology review for 2019

I like to end every year with my selection of the most significant science and technology events.

  1. In 2019, you could buy a computer from Apple with 1.5 terabytes of memory. And by memory, I mean fast internal memory (RAM). Of course, it would cost tens of thousands of dollars. You and I are unlikely to have that much memory on our computers in the near future. Yet we are all benefiting from faster disks, making the distinction between disk and memory harder. (Note that it has been possible to buy computers with a terabyte of memory and more from other vendors, but not typically in a mainstream desktop form.)
  2. A team lead by Conboy showed that we could rejuvenate multiple organs in old mice merely by giving them Alk5i and oxytocin. Quoting from the paper:

    Alk5i plus OT quickly and robustly enhanced neurogenesis, reduced neuro-inflammation, improved cognitive performance, and rejuvenated livers and muscle in old mice.

    You can purchase oxytocin on the Internet and many people take it already to improve their sleep.

  3. We can tell how old someone is by looking at “epigenetic markers”. That is, from one of your cells, we can look at how your DNA is arranged and expressed, and it tells us your age. In some sense, it defines your “biological age” as opposed to your “chronological age”. As far as we could tell, there was no way to intervene and reverse this age. However, in a study reported in Nature, researchers found such age reversal in a small clinical trial. This age reversal came about accidentally when trying to regrow the “thymus” in older people. The thymus naturally disappears with age, leaving older people with weakened immune systems. The small clinical trial also showed that the thymus was regrown, something that is unexpected.
  4. Apple released a smart watch with an integrated ECG, to monitor your heart conditions.
  5. As we age, we accumulate “senescent cells”. These are old cells that create trouble and refuse to die. We know now to wipe them out in mice, using Dasatinib and Quercetin. It turns out that the exact same approach works in human beings. At least in mice, clearing senescent cells lead to all sorts of health improvements.
  6. A drone attack wiped out 50% of Saudi Arabia’s oil supply for several days. This is a huge impact given that Saudi Arabia provides about 10% of the oil supply worldwide. It is a testimony of how good drone technology has become.
  7. Surgeons are putting patients in suspended animation (they stopped their biological functions) during surgery, replacing their blood with a cold saline solution.

Science and Technology links (December 21st 2019)

  1. The number of research papers with more than 1000 authors is increasingly quickly and reaching many fields.
  2. Researchers at Facebook use neural networks to solve fancy algebraic problems that may be out of reach of computer algebra systems.
  3. At least on the short term, wind turbines may contribute more to global warming than equivalent fossil fuel technologies. The warming effect has been empirically measured.
  4. The Bengal tiger population is increasing quickly in India.

Xor Filters: Faster and Smaller Than Bloom Filters

In software, you frequently need to check whether some objects is in a set. For example, you might have a list of forbidden Web addresses. As someone enters a new Web address, you may want to check whether it is part of your black list. Or maybe you have a large list of already used passwords and you want to check whether the proposed new password is part of this list of compromised passwords.

The standard way to solve this problem is to create some key-value data structure. If you have enough memory, you might create a hash table. Or you might use a good old database.

However, such approaches might use too much memory or be too slow. So you want to use a small data structure that can quickly filter the requests. For example, what if you had a tiny data structure that could reliably tell you that your password is not in the list of compromised password?

One way to implement such a filter would be to compute a hash value of all your objects. A hash value is a random-looking number that you generate from your object, in such a way that the same object always generate the same random-looking number, but such that other objects are likely to generate other numbers. So the password “Toronto” might map to the hash value 32. You can then store these small numbers into a key-value store, like a hash table. Hence, you do not need to store all of the objects. For example, you might store 32-bit numbers for each possible password. If you give me a potential password, I check whether its corresponding 32-bit value is in the list and if it is not, then I tell you that it is safe. So if you give me “Toronto”, I check whether 32 is in my table. Otherwise, I send your request to a larger database, for a full lookup. The probability that I send you in vain for a full lookup is called the “false positive probability”.

Though this hash table approach might be practical, you could end up using 4 bytes per value. Maybe this is too much? Bloom filters come to the rescue. A Bloom filter works similarly, except that you compute several hash values from your object. You use these hash values as indexes inside an array of bits. When adding an object to the set, you set all bits corresponding to the objects to one. When you receive a new object and you want to check whether it belongs to the set, you can just check whether all of the bits have been set. In practice, you will use far fewer than 4 bytes per value (say 12 bits) and still be able to achieve a false positive rate of less than 1%.

While the Bloom filter is a textbook algorithm, it has some significant downsides. A major one is that it needs many data accesses and many hash values to check that an object is part of the set. In short, it is not optimally fast.

Can you do better? Yes. Among other alternatives, Fan et al. introduced Cuckoo filters which use less space and are faster than Bloom filters. While implementing a Bloom filter is a relatively simple exercise, Cuckoo filters require a bit more engineering.

Could we do even better while limiting the code to something you can hold in your head?

It turns out that you can with xor filters. We just published a paper called Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters that will appear in the Journal of Experimental Algorithmics.

The following figure gives the number of bits per entry versus the false positive probability. Xor filters offer better accuracy for a given memory budget. With only 9 bits per entry, you can get a false positive probability much less than 1%.

The complete implementation in Go fits in less than 300 lines and I did not even try to be short. In fact, any semi-competent Go coder can make it fit within 200 lines.

We also have an optimized C version that can be easily integrated into your projects since it is a single-header. It is larger than 300 lines, but contains different alternatives including an approach with slightly faster construction. I wrote a small demonstration in C with a compromised password detection problem. The xor filters takes a bit longer to build, but once built, it uses less memory and is about 25% faster.

We also have Java and C++ implementations. We have Rust, Python
and Erlang versions.

An xor filter is meant to be immutable. You build it once, and simply rebuild it when needed. Though you can update a Bloom filter, by adding keys to it, it means either overallocating the initial filter, or sacrificing accuracy. These filters are typically not meant to be used as dynamic data structure (unlike a hash table) since they have a fixed capacity. In the case of cuckoo filters, once you approach the maximum capacity (say within 94%), then the insertion of new values may fail, and the solution when it does is to rebuild the whole thing.

Deletions are generally unsafe with these filters even in principle because they track hash values and cannot deal with collisions without access to the object data: if you have two objects mapping to the same hash value, and you have a filter on hash values, it is going to be difficult to delete one without the other. No matter what, you must somehow keep track of what was added in the filter and what was removed, and the filter itself cannot tell you. The same issue is at play when you are building the filter: the filter itself cannot tell you whether you already added an element: you have to keep track of this information yourself.

Furthermore all these filters are frequently used in a concurrent setting, with multiple threads. You typically cannot safely modify the filter while querying it. The simplest solution to avoid expensive locks, is to make it immutable. Keep in mind that these filters are meant to be small.

The construction of the xor filter requires several bytes of temporary memory per entry, but this memory can be released immediately after construction and a compact data structure remains, using just a few bits per entry.

If you do not have stringent requirements regarding memory usage, other techniques might be preferable like blocked Bloom filters which have unbeatable speed (at the expense of higher memory usage). The xor filters are really at their best when you want high speed and tight memory usage.

Credit: This is joint work with Thomas Mueller Graf. Xor filters are basically an efficient implementation and specialization of a theoretical construction, the Bloomier filter.

A look back to my 2010 predictions for 2020

Back in 2010, I wrote a post Who is going to need a database engine in 2020?

Let me revisit some of my 2010 statements.

Apple will sell desktops with 1 TB of RAM in 2020.

I am sure that the prediction sounded insane back in 2010, but it actually happened. A Mac Pro can have up to 1.5TB of RAM right now.

Clever programmers can write much faster specialized engines. Obviously, programmers will need help. They will need great librairies to help with data processing, data compression, and data architecture.

This happened too. The number of custom data processing engines has exploded: Elasticsearch (2010), Apache Spark (2014), ClickHouse (2016), and so forth. The great libraries for data processing have also materialized. In this instance, I put my fingers where my mouth was and invested my own time in this trend, to good effect I would say.

I won’t make predictions except to say that I expect new and more powerful programming languages to replace the existing ones. I’ll be pretty sad if in 2020, I’m still primarily using Python, Java and C++. There is so much innovation out there that something strong has to emerge out of it.

New contenders have emerged: Swift (2014), Go (late 2009), Rust (2010), Julia (2012). I have done some work in all three and they are a step forward.

I am still mostly programming in Python, Java, C and C++.

I would say that C++ has become a much better language since 2010. It is not any easier, but it has exceeded my expectations.

If anything, Python’s popularity is far higher than it was back in 2010.

So I am mildly sad. I am still doing a lot of C++ and it is really, really hard. However, I could productively switch to shinier programming languages.

Overall I would say that my 2010 beliefs about the future were accurate, at least as far as this one my blog post is concerned.

Science and Technology links (December 14th 2019)

  1. The computation capacity needed by artificial intelligence doubles every 3.4 months. In parallel, we are making fast progress in hardware and software: what took three hours a year and half ago can now take less than two minutes.
  2. In Luxembourg, a small European country, all public transportation is free. Luxembourg has also a gay prime minister who is married.
  3. Men and women have very different personalities. Knowing only the personality traits of someone, you can tell whether they are a man or a woman with 85% accuracy.
  4. We are deploying drones equipped with machine guns.
  5. Alzheimer’s is a terrible disease and there is currently no proven way to even slow down the disease. We might be very close to approving the first drug which might favorably alter the progression of the disease.
  6. We are approving new medical therapies at record rates.
  7. Many worry that computers and artificial intelligence will destroy so many jobs so quickly that human workers will never recover. Scholl and Hanson find no evidence for the emergence of such a trend.
  8. Crows can use tools and solve multi-step puzzles.
  9. As we age, we accumulate defective non-dividing cells called senescent cells. The older we get the more quickly we accumulate them. The progression is exponential.

Credit: Thanks to Peter Turney for sending me a link to the 2019 AI Index report.

Are 64-bit random identifiers free from collision?

It is common in software system to map objects to unique identifiers. For example, you might map all web pages on the Internet to a unique identifier.

Often, these identifiers are integers. For example, many people like to use 64-bit integers. If you assign two 64-bit integers at random to distinct objects, the probability of a collision is very, very small. You can be confident that they will not collide.

However, what about the case where you have 300 million objects? Or maybe 7 billion objects? What is the probably that at least two of them collide?

This is just the Birthday’s paradox. Wikipedia gives us an approximation to the collision probability assuming that the number of objects r is much smaller than the number of possible values N: 1-exp(-r**2/(2N)). Because there are so many 64-bit integers, it should be a good approximation.

Number of objects Collision probability
500M 0.7%
1B 3%
1.5B 6%
2B 10%
4B 35%

Thus if you have a large system with many objects, it is quite conceivable that your randomly assigned 64-bit identifiers might collide. If a collision is a critical flaw, you probably should not use only 64 bits.

Amazon’s new ARM servers: Graviton 2

Most servers on the Internet run on x64 processors, mostly made by Intel. Meanwhile, most smartphones run ARM processors.

From a business perspective, these are different technologies. The x64 processors are mostly designed by only two companies (Intel and AMD), with one very large dominant player (Intel). In contrast, ARM processors come in many different flavours. Apple, Qualcomm, Samsung and others design their own ARM processors. This diversity can be either a blessing or a curse. The blessing is that you get a lot of direct competition and many custom solutions. The curse is that it is harder to support ARM systems because there are so many variations.

Amazon is the largest public cloud providers and they are large enough to design their own servers and even their own silicon. Some time ago, they launched a service (Graviton) based on their own ARM-based servers. I tested them out, but the performance just was not there. Amazon just announced a second iteration of these servers called Graviton 2 and they claim a 7-fold performance increase over their previous ARM servers. They are based on processor designs made for servers called Neoverse. I do not yet have access to these servers, but Matthew Wilson from Amazon was kind enough to run my standard JSON parsing benchmark (using simdjson and the twitter.json data file).

Compiling the results in a table suggests that these new Amazon servers have processors that are nearly as good as those in a flagship smartphone.

iPhone XR A12 2.5 GHz 1.3 GB/s
Graviton 2 Neoverse N1 2.5 GHz 1.1 GB/s
Ampere (first generation) Skylark 3.2 GHz 0.37 GB/s
Rockpro64 Cortex-A72 1.8 GHz 0.32 GB/s

My fast JSON parsing benchmark is just something I happen to care about. It is probably not representative of whatever you care about. In particular, it is CPU intensive whereas servers have many other bottlenecks.

Nevertheless, I find these results quite encouraging. If I normalize the speed by the frequency, I get that the new Neoverse N1 processor is 2.5 times faster than the Cortex-A72. When they come out, they may be the fastest publicly accessible ARM servers.

Amazon is claiming that these Graviton 2 servers offer much better performance than Intel-based servers (EC2 M5) and that they will be slightly cheaper. My expectation is that the better performance will be due in large part of a higher number of cores.

Update: Matthew Wilson reports 1.3 GB/s following some optimization work.