How fast can scancount be?

In an earlier post, I described the following problem. Suppose that you have tens of arrays of integers. You wish to find all integer that are in more than 3 (say) of the sets. This is neither a union nor an intersection. A simple algorithm to solve this problem is to use a counter for each possible value, and increment the counter each time the value is seen. We can finally check the counters and output the answer.

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;

You can do much better if you break down the space of the problem, so that all of your counters fit in the CPU cache. You solve the problem using multiple passes. It also helps if you use small counters, say 8-bit counters.

With this early optimization, I was happy to report a performance of about 16 cycles per input value. It is pretty good… but Nathan Kurz and Travis Downs pointed out that you can do much better.

What are some of the issues?

  1. I was using textbook C++ with iterators and STL algorithms. Converting the problem into lower-level code (using pointers or indexes) seems better for performance.
  2. Using 8-bit counters triggers all sorts of nasty “aliasing” problems: effectively, the compiler becomes paranoid and thinks that you might be writing all over the memory. Travis wrote a blog post on the topic and on how you can mitigate the issue.
  3. There are many low-level issues one might want to handle. For example, if you have 8-bit counter values and an 32-bit threshold value, an extra instruction might be needed to zero-extend your counter value.

We now have two much faster versions to offer. They run at between 3 to 4 cycles per input element. If you are keeping track, it is better than four times faster than my original “optimized” version. I make the code available on GitHub as a header-only C++ library.

    1. The first version is derived from a comment by Nathan Kurz. It works identically to my original version, but instead of using a second pass through the counters, we check the counter values as we modify them. Other than that, I just carefully rewrote my original C++ in “plain C”. I achieve 4 cycles per input element and a number of instructions per cycle of 3.1 on a skylake processor with GNU GCC and around 3.5 cycles per input element under LLVM clang. The implementation should be portable.
    2. Travis was nice enough to share his implementation which follows the same original algorithm, but has been vectorized: it uses fast AVX2 instructions. It is slightly faster under the GNU GCC 8 compiler, sometimes approaching 3 cycles per input element.


So how fast can scancount be? I have not yet answered this question.

In an earlier comment, Travis gave a bound. If each input element must result into the increment of a counter that is not a register, then we have one store per input element, and thus we need at least one cycle per input element on processors that are limited to one store per cycle.

Of course, we also need to load data: say one load to get the element, and one load to get the corresponding counter. Assuming that our hardware can do either two loads per cycle or one store per cycle, then we get a limit of at least two cycles per input element. Based on this napkin mathematics, if we achieve 4 cycles, then we are within a factor of two of the best possible algorithm.

Still: we could go much faster without breaking the speed of light. We are almost solving an embarrassingly parallel problem: we break the problem into subspaces (for the caching benefits) but this means that should be able to dispatch independent components of the problem to distinct cores.

Science and Technology links (August 24th 2019)

  1. The net contribution of the Amazon ecosystem to the world’s oxygen is effectively zero. Furthermore, there is a lot of oxygen in the air and it would be excessively difficult to lower or increase the ratio. In effect, we are not going to run out of oxygen on Earth any time soon, even if we tried to.
  2. According to Nature, the Earth is getting greener. The main driver is climate change and CO2 which acts as a fertilizer. A large fraction of this greening is happening in China and India. These countries have increased their food production by 35% since 2000.
  3. Vitamin D supplements appear to reduce cancer risks. Vitamin D supplements are typically available as either vitamin D2 or D3. We know that D2 supplements do not lower mortality risks. We do not have enough data to assess vitamin D3. (Zhang et al., 2019)
  4. There is no evidence that plastic particules in water are harmful to human beings.
  5. CNBC reports that Mackey, a vegan who runs the successful Whole Foods chain, is critical of the fashionable Beyond Meat “plant-based meat”:

    I don’t think eating highly processed foods is healthy. I think people thrive on eating whole foods. As for health, I will not endorse that, and that is about as big of criticism that I will do in public.

    Meanwhile the stock price of Beyond Meat went up 800% in a few months.

Science and Technology links (August 17th 2019)

  1. Google may soon release an augmented-reality version of Google Maps for mobile phones. I have always found it difficult to identify the streets I see in an unfamiliar setting, and it looks like Google may be solving this very problem. I am looking forward to seeing how well it works in practice.
  2. The delivery company UPS has been using self-driving trucks for some time. However the trucks are still manned by a driver and an engineer.
  3. Researchers have rejuvenated the brain cells of old rats. There is a press release.
  4. Blood pressure is an important health marker and it would be important to monitor it more broadly. There are cheap devices to measure blood pressure automatically, but they typically require that you put a strap around your arm. Researchers have shown that we can measure blood pressure using the camera of a smartphone. It remains to verify that it can work outside of the lab, in the real world.
  5. Are electric cars good for the environment? According to Holland et al., they are worse than gasoline cars on average, even though they may be preferable where you live:

    Ignoring local pollution leads to an overestimate of the benefits of electric vehicles and an underestimate of the geographic heterogeneity. Accounting for both global and local pollution, we find electric vehicles generate negative environmental benefits of 0.73 cents per mile on average relative to comparable gasoline vehicles. (…) On average, electric vehicles driven in metropolitan areas generate benefits of about $0.01 per mile while those driven outside metropolitan areas generate benefits of −1.7 cents per mile.

Faster threshold queries with cache-sensitive scancount

Suppose that you are given 100 sorted arrays of integers. You can compute their union or their intersection. It is a common setup in data indexing: the integers might be unique identifiers.

But there is more than just intersections and unions… What if you want all values that appear in more than three arrays?

A really good algorithm for this problem is called scancount. It is good because it is simple and usually quite fast.

Suppose that all my integers are in the interval [0, 20M). You start with an array of counters, initialized at zero. You scan all your arrays for each value in the array, you increment the corresponding counter. When you have scanned all arrays, you scan your counters, looking for counter values greater than your threshold (3).

The pseudocode looks like this…

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;

This algorithm is almost entirely bounded by “memory accesses”. Memory-wise if you only have about 100 arrays, you only need 8-bit counters. So I can store all counters in about 20 MB. Sadly, this means that the counters do not fit in processor cache.

Can you make scancount faster without sacrificing too much simplicity?

So far we did not use the fact that our arrays can be sorted. Because they are sorted, then you can solve the problem in “cache-sensitive” or “cache-aware” chunks.

Build a small array of counters, spanning maybe only 256 kB. Process all arrays, as with the naive scancount, but suspend the processing of this array as soon as a value in the array exceeds 262144. This allows you to find all matching values in the interval [0, 262144). Next repeat the problem with the next interval ([262144,524288)), and so forth. In this manner, you will have far fewer expensive cache misses.

I implemented this solution in C++. Here are my results using random arrays, GNU GCC 8 and a Skylake processor. I report the number of CPU cycles per value in the arrays.

naive scancount 37 cycles
cache-sensitive scancount 16 cycles

Further reading: Compressed bitmap indexes: beyond unions and intersections, Software: Practice & Experience 46 (2), 2016.

Science and Technology links (August 10th 2019

    1. A short (less than one hour) online lesson improved grades among weaker students. The lesson taught the growth mindset: that intellectual abilities can be developed.
    2. The chip maker AMD, has released a processor (EPYC 7742) which has 32-billion transistors and 64 cores.
    3. According to a recent study, we tend to see women as victims and men as perpetrator. Managers are perceived as less moral and fair when they fire a group of female (versus male) employees. Female victims are expected to experience more pain from an ambiguous joke.
    4. Blue zones are regions in the world where people claim to live longer. Newman questions the validity of these zones in an article entitled Supercentenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans. The most likely explanation for these zones is that you find more centenarians in regions where it is easiest to lie about one’s own age.
    5. Young men spend more time than ever playing video games. It is also true young men in the West work fewer hours. Yet video games are not to blame for the reduced work.
    6. There is a massive overproduction of coffee beans, leading to falling prices.
    7. In most cases, electric fans are a good way to mitigate heat-related illnesses, despite contrary government advice. Don’t be shy and use a fan when it is too hot! It is safe and it cools you down.
    8. McDonald’s admitted that its paper straws cannot be recycled at this time, contrary to its plastic straws.
    9. Bilingual people are much more resilient to dementia (like Alzheimer’s) than monolingual people, suffering from dementia four years later on average.
    10. In mice, we can stimulate hair growth using prescription drugs rapamycin and metformin. These drugs are often viewed as anti-aging drugs.
    11. A single season of collegiate football (without injury) leads to a reduction in brain integrity.
    12. Playing video games (but not taking piano lessons) increases grey matter volume in the brain.
    13. Building one wind turbine requires 900 tons of steel, 2,500 tons of concrete and 45 tons of plastic. Much of which is produced with coal power. Moreover, most solar panels are built in China using coal power.

Science and Technology links (August 3rd 2019)

JSON parsing: simdjson vs. JSON for Modern C++

JSON is the ubiquitous data format on the Internet. There is a lot of JSON that needs to be parsed and validated.

As we just released the latest version of our fast JSON parser (simdjson), a reader of this blog asked how it compared in speed against one of the nicest C++ JSON libraries: JSON for Modern C++ (nlohmann/json).

We have not reported on benchmarks against “JSON for Modern C++” because we knew that it was not designed for raw speed. It is designed for ease-of-use: it makes the life of the programmer as easy as possible. In contrast, simdjson optimizes for speed, even when it requires a bit more work from the programmer. Nevertheless, it is still interesting to compare speed, to know what your trade-off is.

Let us run some benchmarks… I use a Skylake processor with GNU GCC 8.3.

file simdjson JSON for Modern C++
apache_builds 2.3 GB/s 0.098 GB/s
github_events 2.5 GB/s 0.093 GB/s
gsoc-2018 3.3 GB/s 0.091 GB/s
twitter 2.2 GB/s 0.094 GB/s

Thus, roughly speaking, simdjson can parse, validate a JSON document and create a DOM tree, at a speed of about 2 GB/s. The easier-to-use “JSON for Modern C++” has a speed of about 0.1 GB/s, so about 20x slower. As a reference, we can easily read files from disk or the network at speeds higher than 2 GB/s.

Link: simdjson.

A new release of simdjson: runtime dispatching, 64-bit ARM support and more

JSON is a ubiquitous data exchange format. It is found everywhere on the Internet. To consume JSON, software uses tools called JSON parsers.

Earlier this year, we released the first version of our JSON parsing library, simdjson. It is arguably the fastest standard-compliant parser in the world. It provides full validation. That is, while we try to go as fast as possible, we do not compromise on input validation, and we still process many files at gigabytes per second.

You may wonder why you would care about going fast? But consider that good disks and network chips can provide gigabytes per second of bandwidth. If you can’t process simple files like JSON at comparable speeds, all the bandwidth in the world won’t help you.

So how do we do it? Our trick is to leverage the advanced instructions found on commodity processors such as SIMD instructions. We are not the first ones to go this route, but I think we are the first ones to go so far.

We have published a new  new major release (0.2.0). It brings about many improvements:

  • This first version only ran on recent Intel and AMD processors. We were able to add support for 64-bit ARM processors found on mobile devices. You can run our library on an iPhone. We found that our great speed carries over: we are sometimes close to 2 GB/s.
  • We added support for older PC processors. In fact, we did better: we also introduced runtime dispatching on x64 processors. It means that the software will smartly recognize the feature of your processor and adapt, running the best code paths. We support processors as far back as the Intel Westmere or the AMD Jaguar (PlayStation 4).
  • We now support JSON Pointer queries.

Many people contributed to this new release. It is a true community effort.

This is a lot more coming in the future. Among other things, we expect to go even faster.

Link: GitHub.

Science and Technology links (July 27th 2019)

    1. There are thick ice deposits on the Moon. Water in space is important as it can be used to create fuel and to sustain life.
    2. Most animals do not suffer heart attacks and strokes, the way human beings do. It appears that the difference may be due to a specific genetic mutation. Heart attacks are the leading cause of death in the West. In turn, insulin resistance (diabetes) is the most important single cause of coronary artery disease.
    3. Restricting the time period during which you eat (to say between the morning up til the early afternoon), reduce your appetite and may help you lose weight.

How fast can a BufferedReader read lines in Java?

In an earlier post, I asked how fast the getline function in C++ could run through the lines in a text file. The answer was about 2 GB/s, certainly over 1 GB/s. That is slower than some of the best disk drives and network connections. If you take into account that software rarely only need to “just” access the lines, it is easy to build a system where text-file processing is processor-bound, as opposed to disk or network bound.

What about Java? In Java, the standard way to access lines in a text file is to use a BufferedReader. To avoid system calls, I create a large string containing many lines of text, and then I call a very simple processing function that merely records the length of the strings…

StringReader fr = new StringReader(data);
BufferedReader bf = new BufferedReader(fr);
bf.lines().forEach(s -> parseLine(s));

// elsewhere:
public void parseLine(String s) {
  volume += s.length();

The result is that Java is at least two times slower than C++, on the same system, for this benchmark:

BufferedReader.lines 0.5 GB/s

This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.

Many firms probably just throw more hardware at the problem.

My source code is available.

Update: An earlier version of the code had a small typo that would create just one giant line. This turns out not to impact the results too much. Some people asked for more technical details. I ran the benchmark on a Skylake processor using GNU GCC 8.1 as a C++ compiler and Java 12, all under Linux. Results will vary depending on your exact configuration.