Science and Technology links (August 17th 2019)

  1. Google may soon release an augmented-reality version of Google Maps for mobile phones. I have always found it difficult to identify the streets I see in an unfamiliar setting, and it looks like Google may be solving this very problem. I am looking forward to seeing how well it works in practice.
  2. The delivery company UPS has been using self-driving trucks for some time. However the trucks are still manned by a driver and an engineer.
  3. Researchers have rejuvenated the brain cells of old rats. There is a press release.
  4. Blood pressure is an important health marker and it would be important to monitor it more broadly. There are cheap devices to measure blood pressure automatically, but they typically require that you put a strap around your arm. Researchers have shown that we can measure blood pressure using the camera of a smartphone. It remains to verify that it can work outside of the lab, in the real world.
  5. Are electric cars good for the environment? According to Holland et al., they are worse than gasoline cars on average, even though they may be preferable where you live:

    Ignoring local pollution leads to an overestimate of the benefits of electric vehicles and an underestimate of the geographic heterogeneity. Accounting for both global and local pollution, we find electric vehicles generate negative environmental benefits of 0.73 cents per mile on average relative to comparable gasoline vehicles. (…) On average, electric vehicles driven in metropolitan areas generate benefits of about $0.01 per mile while those driven outside metropolitan areas generate benefits of −1.7 cents per mile.

Faster threshold queries with cache-sensitive scancount

Suppose that you are given 100 sorted arrays of integers. You can compute their union or their intersection. It is a common setup in data indexing: the integers might be unique identifiers.

But there is more than just intersections and unions… What if you want all values that appear in more than three arrays?

A really good algorithm for this problem is called scancount. It is good because it is simple and usually quite fast.

Suppose that all my integers are in the interval [0, 20M). You start with an array of counters, initialized at zero. You scan all your arrays for each value in the array, you increment the corresponding counter. When you have scanned all arrays, you scan your counters, looking for counter values greater than your threshold (3).

The pseudocode looks like this…

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;
}

This algorithm is almost entirely bounded by “memory accesses”. Memory-wise if you only have about 100 arrays, you only need 8-bit counters. So I can store all counters in about 20 MB. Sadly, this means that the counters do not fit in processor cache.

Can you make scancount faster without sacrificing too much simplicity?

So far we did not use the fact that our arrays can be sorted. Because they are sorted, then you can solve the problem in “cache-sensitive” or “cache-aware” chunks.

Build a small array of counters, spanning maybe only 256 kB. Process all arrays, as with the naive scancount, but suspend the processing of this array as soon as a value in the array exceeds 262144. This allows you to find all matching values in the interval [0, 262144). Next repeat the problem with the next interval ([262144,524288)), and so forth. In this manner, you will have far fewer expensive cache misses.

I implemented this solution in C++. Here are my results using random arrays, GNU GCC 8 and a Skylake processor. I report the number of CPU cycles per value in the arrays.

naive scancount 37 cycles
cache-sensitive scancount 16 cycles

Further reading: Compressed bitmap indexes: beyond unions and intersections, Software: Practice & Experience 46 (2), 2016.

Science and Technology links (August 10th 2019

    1. A short (less than one hour) online lesson improved grades among weaker students. The lesson taught the growth mindset: that intellectual abilities can be developed.
    2. The chip maker AMD, has released a processor (EPYC 7742) which has 32-billion transistors and 64 cores.
    3. According to a recent study, we tend to see women as victims and men as perpetrator. Managers are perceived as less moral and fair when they fire a group of female (versus male) employees. Female victims are expected to experience more pain from an ambiguous joke.
    4. Blue zones are regions in the world where people claim to live longer. Newman questions the validity of these zones in an article entitled Supercentenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans. The most likely explanation for these zones is that you find more centenarians in regions where it is easiest to lie about one’s own age.
    5. Young men spend more time than ever playing video games. It is also true young men in the West work fewer hours. Yet video games are not to blame for the reduced work.
    6. There is a massive overproduction of coffee beans, leading to falling prices.
    7. In most cases, electric fans are a good way to mitigate heat-related illnesses, despite contrary government advice. Don’t be shy and use a fan when it is too hot! It is safe and it cools you down.
    8. McDonald’s admitted that its paper straws cannot be recycled at this time, contrary to its plastic straws.
    9. Bilingual people are much more resilient to dementia (like Alzheimer’s) than monolingual people, suffering from dementia four years later on average.
    10. In mice, we can stimulate hair growth using prescription drugs rapamycin and metformin. These drugs are often viewed as anti-aging drugs.
    11. A single season of collegiate football (without injury) leads to a reduction in brain integrity.
    12. Playing video games (but not taking piano lessons) increases grey matter volume in the brain.
    13. Building one wind turbine requires 900 tons of steel, 2,500 tons of concrete and 45 tons of plastic. Much of which is produced with coal power. Moreover, most solar panels are built in China using coal power.

Science and Technology links (August 3rd 2019)

JSON parsing: simdjson vs. JSON for Modern C++

JSON is the ubiquitous data format on the Internet. There is a lot of JSON that needs to be parsed and validated.

As we just released the latest version of our fast JSON parser (simdjson), a reader of this blog asked how it compared in speed against one of the nicest C++ JSON libraries: JSON for Modern C++ (nlohmann/json).

We have not reported on benchmarks against “JSON for Modern C++” because we knew that it was not designed for raw speed. It is designed for ease-of-use: it makes the life of the programmer as easy as possible. In contrast, simdjson optimizes for speed, even when it requires a bit more work from the programmer. Nevertheless, it is still interesting to compare speed, to know what your trade-off is.

Let us run some benchmarks… I use a Skylake processor with GNU GCC 8.3.

file simdjson JSON for Modern C++
apache_builds 2.3 GB/s 0.098 GB/s
github_events 2.5 GB/s 0.093 GB/s
gsoc-2018 3.3 GB/s 0.091 GB/s
twitter 2.2 GB/s 0.094 GB/s

Thus, roughly speaking, simdjson can parse, validate a JSON document and create a DOM tree, at a speed of about 2 GB/s. The easier-to-use “JSON for Modern C++” has a speed of about 0.1 GB/s, so about 20x slower. As a reference, we can easily read files from disk or the network at speeds higher than 2 GB/s.

Link: simdjson.

A new release of simdjson: runtime dispatching, 64-bit ARM support and more

JSON is a ubiquitous data exchange format. It is found everywhere on the Internet. To consume JSON, software uses tools called JSON parsers.

Earlier this year, we released the first version of our JSON parsing library, simdjson. It is arguably the fastest standard-compliant parser in the world. It provides full validation. That is, while we try to go as fast as possible, we do not compromise on input validation, and we still process many files at gigabytes per second.

You may wonder why you would care about going fast? But consider that good disks and network chips can provide gigabytes per second of bandwidth. If you can’t process simple files like JSON at comparable speeds, all the bandwidth in the world won’t help you.

So how do we do it? Our trick is to leverage the advanced instructions found on commodity processors such as SIMD instructions. We are not the first ones to go this route, but I think we are the first ones to go so far.

We have published a new  new major release (0.2.0). It brings about many improvements:

  • This first version only ran on recent Intel and AMD processors. We were able to add support for 64-bit ARM processors found on mobile devices. You can run our library on an iPhone. We found that our great speed carries over: we are sometimes close to 2 GB/s.
  • We added support for older PC processors. In fact, we did better: we also introduced runtime dispatching on x64 processors. It means that the software will smartly recognize the feature of your processor and adapt, running the best code paths. We support processors as far back as the Intel Westmere or the AMD Jaguar (PlayStation 4).
  • We now support JSON Pointer queries.

Many people contributed to this new release. It is a true community effort.

This is a lot more coming in the future. Among other things, we expect to go even faster.

Link: GitHub.

Science and Technology links (July 27th 2019)

    1. There are thick ice deposits on the Moon. Water in space is important as it can be used to create fuel and to sustain life.
    2. Most animals do not suffer heart attacks and strokes, the way human beings do. It appears that the difference may be due to a specific genetic mutation. Heart attacks are the leading cause of death in the West. In turn, insulin resistance (diabetes) is the most important single cause of coronary artery disease.
    3. Restricting the time period during which you eat (to say between the morning up til the early afternoon), reduce your appetite and may help you lose weight.

How fast can a BufferedReader read lines in Java?

In an earlier post, I asked how fast the getline function in C++ could run through the lines in a text file. The answer was about 2 GB/s, certainly over 1 GB/s. That is slower than some of the best disk drives and network connections. If you take into account that software rarely only need to “just” access the lines, it is easy to build a system where text-file processing is processor-bound, as opposed to disk or network bound.

What about Java? In Java, the standard way to access lines in a text file is to use a BufferedReader. To avoid system calls, I create a large string containing many lines of text, and then I call a very simple processing function that merely records the length of the strings…

StringReader fr = new StringReader(data);
BufferedReader bf = new BufferedReader(fr);
bf.lines().forEach(s -> parseLine(s));

// elsewhere:
public void parseLine(String s) {
  volume += s.length();
}

The result is that Java is at least two times slower than C++, on the same system, for this benchmark:

BufferedReader.lines 0.5 GB/s

This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.

Many firms probably just throw more hardware at the problem.

My source code is available.

Update: An earlier version of the code had a small typo that would create just one giant line. This turns out not to impact the results too much. Some people asked for more technical details. I ran the benchmark on a Skylake processor using GNU GCC 8.1 as a C++ compiler and Java 12, all under Linux. Results will vary depending on your exact configuration.

Programming competition with $1000 in prizes: make my code readable!

Colm MacCárthaigh is organizing a programming competition with three 3 prizes: $500, $300, $200. The objective? Produce the most readable, easy to follow, and well tested implementation of the nearly divisionless random integer algorithm. The scientific reference is Fast Random Integer Generation in an Interval (ACM Transactions on Modeling and Computer Simulation, 2019). I have a few blog posts on the topic such as Nearly Divisionless Random Integer Generation On Various Systems.

This algorithm has been added to the Swift standard library by Pavol Vaskovic. It has also been added to the Go standard library by Josh Snyder. And it is part of the Python library Numpy thanks to Bernardt Duvenhage and others.

I have a C++ repository on GitHub with relevant experiments as well as a Python extension.

Colm wants the implementation to be licensed under the Apache Software License 2.0. It could be in any programming language you like. The deadline is September 1st 2019. You can find Colm on Twitter.

Arbitrary byte-to-byte maps using ARM NEON?

Modern processors have fast instructions that can operate on wide registers (e.g., 128-bit). ARM processors, the kind of processors found in your phone, have such instructions called “NEON”. Sebastian Pop pointed me to some of his work doing fast string transformations using NEON instructions. Sebastian has done some great work to accelerate the PHP interpreter on ARM processors. One of his recent optimization is a way to transform the case of strings quickly.

It suggested the following problem to me. Suppose that you have a stream of bytes. You want to transform byte values in an arbitrary manner. Maybe you want to map the byte value 1 to the byte value 12, the byte value 2 to the byte value 53… and so forth.

Here is how you might implement such a function in plain C:

 for(size_t i = 0; i < volume; i++) {
    values[i] = map[values[i]];
 }

For each byte, you need two loads (to get to map[values[i]]) and one store, assuming that the compiler does not do any magic.

To implement such a function on block of 16 bytes with NEON, we use the vqtbl4q_u8 function which is essentially a way to do 16 independent look-up in a 64-byte table. It uses the least significant 5 bits as a look-up index. If any of the other bits are non-zero, it outputs zero. Because there are 256 different values, we need four distinct calls to the vqtbl4q_u8 function. One of them will give non-zero results for byte values in [0,64), another one for bytes values in [64,128), another one for byte values in [128,192), and a final one for byte values in [192,256). We select the right values with a bitwise XOR (and the veorq_u8 function). Finally, we just need to apply bitwise ORs to glue the results back together (via the vorrq_u8 function).

uint8x16_t simd_transform16(uint8x16x4_t * table, uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
       veorq_u8(input, vdupq_n_u8(0x40)));
  uint8x16_t  t3 = vqtbl4q_u8(table[2],  
       veorq_u8(input, vdupq_n_u8(0x80)));
  uint8x16_t  t4 = vqtbl4q_u8(table[3],  
       veorq_u8(input, vdupq_n_u8(0xc0)));
  return vorrq_u8(vorrq_u8(t1,t2), vorrq_u8(t3,t4));
}

In terms of loads and stores, assuming that you enough registers, you only have one load and one store per block of 16 bytes.

A more practical scenario might be to assume that all my byte values fit in [0,128), as is the case with a stream of ASCII characters…

uint8x16_t simd_transform16_ascii(uint8x16x4_t * table, 
               uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
     veorq_u8(input, vdupq_n_u8(0x40)));
  return vorrq_u8(t1,t2);
}

To test it out, I wrote a benchmark which I ran on a Cortex A72 processor. My source code is available. I get a sizeable speed bump when I use NEON with an ASCII input, but the general NEON scenario is slower than a plain C version.

plain C 1.15 ns/byte
neon 1.35 ns/byte
neon (ascii) 0.71 ns/byte

What about Intel and AMD processors? Most of them do not have 64-byte lookup tables. They are limited to 16-byte tables. We need to wait for AVX-512 instructions for wider vectorized lookup tables. Unfortunately, AVX-512 is only available on some Intel processors and it is unclear when it will appear on AMD processors.