Science and Technology links (August 17th 2019)

  1. Google may soon release an augmented-reality version of Google Maps for mobile phones. I have always found it difficult to identify the streets I see in an unfamiliar setting, and it looks like Google may be solving this very problem. I am looking forward to seeing how well it works in practice.
  2. The delivery company UPS has been using self-driving trucks for some time. However the trucks are still manned by a driver and an engineer.
  3. Researchers have rejuvenated the brain cells of old rats. There is a press release.
  4. Blood pressure is an important health marker and it would be important to monitor it more broadly. There are cheap devices to measure blood pressure automatically, but they typically require that you put a strap around your arm. Researchers have shown that we can measure blood pressure using the camera of a smartphone. It remains to verify that it can work outside of the lab, in the real world.
  5. Are electric cars good for the environment? According to Holland et al., they are worse than gasoline cars on average, even though they may be preferable where you live:

    Ignoring local pollution leads to an overestimate of the benefits of electric vehicles and an underestimate of the geographic heterogeneity. Accounting for both global and local pollution, we find electric vehicles generate negative environmental benefits of 0.73 cents per mile on average relative to comparable gasoline vehicles. (…) On average, electric vehicles driven in metropolitan areas generate benefits of about $0.01 per mile while those driven outside metropolitan areas generate benefits of −1.7 cents per mile.

Faster threshold queries with cache-sensitive scancount

Suppose that you are given 100 sorted arrays of integers. You can compute their union or their intersection. It is a common setup in data indexing: the integers might be unique identifiers.

But there is more than just intersections and unions… What if you want all values that appear in more than three arrays?

A really good algorithm for this problem is called scancount. It is good because it is simple and usually quite fast.

Suppose that all my integers are in the interval [0, 20M). You start with an array of counters, initialized at zero. You scan all your arrays for each value in the array, you increment the corresponding counter. When you have scanned all arrays, you scan your counters, looking for counter values greater than your threshold (3).

The pseudocode looks like this…

counter <- array of zeros
for array x in set of arrays {
    for value y in array x {
      counter[y] += 1
}
for i = 0; i < counter.size(); i++ {
  if(counter[i] > threshold)
     output i;
}

This algorithm is almost entirely bounded by “memory accesses”. Memory-wise if you only have about 100 arrays, you only need 8-bit counters. So I can store all counters in about 20 MB. Sadly, this means that the counters do not fit in processor cache.

Can you make scancount faster without sacrificing too much simplicity?

So far we did not use the fact that our arrays can be sorted. Because they are sorted, then you can solve the problem in “cache-sensitive” or “cache-aware” chunks.

Build a small array of counters, spanning maybe only 256 kB. Process all arrays, as with the naive scancount, but suspend the processing of this array as soon as a value in the array exceeds 262144. This allows you to find all matching values in the interval [0, 262144). Next repeat the problem with the next interval ([262144,524288)), and so forth. In this manner, you will have far fewer expensive cache misses.

I implemented this solution in C++. Here are my results using random arrays, GNU GCC 8 and a Skylake processor. I report the number of CPU cycles per value in the arrays.

naive scancount 37 cycles
cache-sensitive scancount 16 cycles

Further reading: Compressed bitmap indexes: beyond unions and intersections, Software: Practice & Experience 46 (2), 2016.

Science and Technology links (August 10th 2019

    1. A short (less than one hour) online lesson improved grades among weaker students. The lesson taught the growth mindset: that intellectual abilities can be developed.
    2. The chip maker AMD, has released a processor (EPYC 7742) which has 32-billion transistors and 64 cores.
    3. According to a recent study, we tend to see women as victims and men as perpetrator. Managers are perceived as less moral and fair when they fire a group of female (versus male) employees. Female victims are expected to experience more pain from an ambiguous joke.
    4. Blue zones are regions in the world where people claim to live longer. Newman questions the validity of these zones in an article entitled Supercentenarians and the oldest-old are concentrated into regions with no birth certificates and short lifespans. The most likely explanation for these zones is that you find more centenarians in regions where it is easiest to lie about one’s own age.
    5. Young men spend more time than ever playing video games. It is also true young men in the West work fewer hours. Yet video games are not to blame for the reduced work.
    6. There is a massive overproduction of coffee beans, leading to falling prices.
    7. In most cases, electric fans are a good way to mitigate heat-related illnesses, despite contrary government advice. Don’t be shy and use a fan when it is too hot! It is safe and it cools you down.
    8. McDonald’s admitted that its paper straws cannot be recycled at this time, contrary to its plastic straws.
    9. Bilingual people are much more resilient to dementia (like Alzheimer’s) than monolingual people, suffering from dementia four years later on average.
    10. In mice, we can stimulate hair growth using prescription drugs rapamycin and metformin. These drugs are often viewed as anti-aging drugs.
    11. A single season of collegiate football (without injury) leads to a reduction in brain integrity.
    12. Playing video games (but not taking piano lessons) increases grey matter volume in the brain.
    13. Building one wind turbine requires 900 tons of steel, 2,500 tons of concrete and 45 tons of plastic. Much of which is produced with coal power. Moreover, most solar panels are built in China using coal power.

Science and Technology links (August 3rd 2019)

JSON parsing: simdjson vs. JSON for Modern C++

JSON is the ubiquitous data format on the Internet. There is a lot of JSON that needs to be parsed and validated.

As we just released the latest version of our fast JSON parser (simdjson), a reader of this blog asked how it compared in speed against one of the nicest C++ JSON libraries: JSON for Modern C++ (nlohmann/json).

We have not reported on benchmarks against “JSON for Modern C++” because we knew that it was not designed for raw speed. It is designed for ease-of-use: it makes the life of the programmer as easy as possible. In contrast, simdjson optimizes for speed, even when it requires a bit more work from the programmer. Nevertheless, it is still interesting to compare speed, to know what your trade-off is.

Let us run some benchmarks… I use a Skylake processor with GNU GCC 8.3.

file simdjson JSON for Modern C++
apache_builds 2.3 GB/s 0.098 GB/s
github_events 2.5 GB/s 0.093 GB/s
gsoc-2018 3.3 GB/s 0.091 GB/s
twitter 2.2 GB/s 0.094 GB/s

Thus, roughly speaking, simdjson can parse, validate a JSON document and create a DOM tree, at a speed of about 2 GB/s. The easier-to-use “JSON for Modern C++” has a speed of about 0.1 GB/s, so about 20x slower. As a reference, we can easily read files from disk or the network at speeds higher than 2 GB/s.

Link: simdjson.

A new release of simdjson: runtime dispatching, 64-bit ARM support and more

JSON is a ubiquitous data exchange format. It is found everywhere on the Internet. To consume JSON, software uses tools called JSON parsers.

Earlier this year, we released the first version of our JSON parsing library, simdjson. It is arguably the fastest standard-compliant parser in the world. It provides full validation. That is, while we try to go as fast as possible, we do not compromise on input validation, and we still process many files at gigabytes per second.

You may wonder why you would care about going fast? But consider that good disks and network chips can provide gigabytes per second of bandwidth. If you can’t process simple files like JSON at comparable speeds, all the bandwidth in the world won’t help you.

So how do we do it? Our trick is to leverage the advanced instructions found on commodity processors such as SIMD instructions. We are not the first ones to go this route, but I think we are the first ones to go so far.

We have published a new  new major release (0.2.0). It brings about many improvements:

  • This first version only ran on recent Intel and AMD processors. We were able to add support for 64-bit ARM processors found on mobile devices. You can run our library on an iPhone. We found that our great speed carries over: we are sometimes close to 2 GB/s.
  • We added support for older PC processors. In fact, we did better: we also introduced runtime dispatching on x64 processors. It means that the software will smartly recognize the feature of your processor and adapt, running the best code paths. We support processors as far back as the Intel Westmere or the AMD Jaguar (PlayStation 4).
  • We now support JSON Pointer queries.

Many people contributed to this new release. It is a true community effort.

This is a lot more coming in the future. Among other things, we expect to go even faster.

Link: GitHub.

Science and Technology links (July 27th 2019)

    1. There are thick ice deposits on the Moon. Water in space is important as it can be used to create fuel and to sustain life.
    2. Most animals do not suffer heart attacks and strokes, the way human beings do. It appears that the difference may be due to a specific genetic mutation. Heart attacks are the leading cause of death in the West. In turn, insulin resistance (diabetes) is the most important single cause of coronary artery disease.
    3. Restricting the time period during which you eat (to say between the morning up til the early afternoon), reduce your appetite and may help you lose weight.

How fast can a BufferedReader read lines in Java?

In an earlier post, I asked how fast the getline function in C++ could run through the lines in a text file. The answer was about 2 GB/s, certainly over 1 GB/s. That is slower than some of the best disk drives and network connections. If you take into account that software rarely only need to “just” access the lines, it is easy to build a system where text-file processing is processor-bound, as opposed to disk or network bound.

What about Java? In Java, the standard way to access lines in a text file is to use a BufferedReader. To avoid system calls, I create a large string containing many lines of text, and then I call a very simple processing function that merely records the length of the strings…

StringReader fr = new StringReader(data);
BufferedReader bf = new BufferedReader(fr);
bf.lines().forEach(s -> parseLine(s));

// elsewhere:
public void parseLine(String s) {
  volume += s.length();
}

The result is that Java is at least two times slower than C++, on the same system, for this benchmark:

BufferedReader.lines 0.5 GB/s

This is not the best that Java can do: Java can ingest data much faster. However, my results suggest that on modern systems, Java file parsing might be frequently processor-bound, as opposed to system bound. That is, you can buy much better disks and network cards, and your system won’t go any faster. Unless, of course, you have really good Java engineers.

Many firms probably just throw more hardware at the problem.

My source code is available.

Update: An earlier version of the code had a small typo that would create just one giant line. This turns out not to impact the results too much. Some people asked for more technical details. I ran the benchmark on a Skylake processor using GNU GCC 8.1 as a C++ compiler and Java 12, all under Linux. Results will vary depending on your exact configuration.

Programming competition with $1000 in prizes: make my code readable!

Colm MacCárthaigh is organizing a programming competition with three 3 prizes: $500, $300, $200. The objective? Produce the most readable, easy to follow, and well tested implementation of the nearly divisionless random integer algorithm. The scientific reference is Fast Random Integer Generation in an Interval (ACM Transactions on Modeling and Computer Simulation, 2019). I have a few blog posts on the topic such as Nearly Divisionless Random Integer Generation On Various Systems.

This algorithm has been added to the Swift standard library by Pavol Vaskovic. It has also been added to the Go standard library by Josh Snyder. And it is part of the Python library Numpy thanks to Bernardt Duvenhage and others.

I have a C++ repository on GitHub with relevant experiments as well as a Python extension.

Colm wants the implementation to be licensed under the Apache Software License 2.0. It could be in any programming language you like. The deadline is September 1st 2019. You can find Colm on Twitter.

Arbitrary byte-to-byte maps using ARM NEON?

Modern processors have fast instructions that can operate on wide registers (e.g., 128-bit). ARM processors, the kind of processors found in your phone, have such instructions called “NEON”. Sebastian Pop pointed me to some of his work doing fast string transformations using NEON instructions. Sebastian has done some great work to accelerate the PHP interpreter on ARM processors. One of his recent optimization is a way to transform the case of strings quickly.

It suggested the following problem to me. Suppose that you have a stream of bytes. You want to transform byte values in an arbitrary manner. Maybe you want to map the byte value 1 to the byte value 12, the byte value 2 to the byte value 53… and so forth.

Here is how you might implement such a function in plain C:

 for(size_t i = 0; i < volume; i++) {
    values[i] = map[values[i]];
 }

For each byte, you need two loads (to get to map[values[i]]) and one store, assuming that the compiler does not do any magic.

To implement such a function on block of 16 bytes with NEON, we use the vqtbl4q_u8 function which is essentially a way to do 16 independent look-up in a 64-byte table. It uses the least significant 5 bits as a look-up index. If any of the other bits are non-zero, it outputs zero. Because there are 256 different values, we need four distinct calls to the vqtbl4q_u8 function. One of them will give non-zero results for byte values in [0,64), another one for bytes values in [64,128), another one for byte values in [128,192), and a final one for byte values in [192,256). We select the right values with a bitwise XOR (and the veorq_u8 function). Finally, we just need to apply bitwise ORs to glue the results back together (via the vorrq_u8 function).

uint8x16_t simd_transform16(uint8x16x4_t * table, uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
       veorq_u8(input, vdupq_n_u8(0x40)));
  uint8x16_t  t3 = vqtbl4q_u8(table[2],  
       veorq_u8(input, vdupq_n_u8(0x80)));
  uint8x16_t  t4 = vqtbl4q_u8(table[3],  
       veorq_u8(input, vdupq_n_u8(0xc0)));
  return vorrq_u8(vorrq_u8(t1,t2), vorrq_u8(t3,t4));
}

In terms of loads and stores, assuming that you enough registers, you only have one load and one store per block of 16 bytes.

A more practical scenario might be to assume that all my byte values fit in [0,128), as is the case with a stream of ASCII characters…

uint8x16_t simd_transform16_ascii(uint8x16x4_t * table, 
               uint8x16_t input) {
  uint8x16_t  t1 = vqtbl4q_u8(table[0],  input);
  uint8x16_t  t2 = vqtbl4q_u8(table[1],  
     veorq_u8(input, vdupq_n_u8(0x40)));
  return vorrq_u8(t1,t2);
}

To test it out, I wrote a benchmark which I ran on a Cortex A72 processor. My source code is available. I get a sizeable speed bump when I use NEON with an ASCII input, but the general NEON scenario is slower than a plain C version.

plain C 1.15 ns/byte
neon 1.35 ns/byte
neon (ascii) 0.71 ns/byte

What about Intel and AMD processors? Most of them do not have 64-byte lookup tables. They are limited to 16-byte tables. We need to wait for AVX-512 instructions for wider vectorized lookup tables. Unfortunately, AVX-512 is only available on some Intel processors and it is unclear when it will appear on AMD processors.

Science and Technology links (July 20th 2019)

  1. Researchers solve the Rubik’s cube puzzle using machine learning (deep learning).
  2. There has been a rise in the popularity of “deep learning” following some major breakthroughs in tasks like image recognition. Yet, at least as far as recommender systems are concerned, there are reasons to be skeptical of the good results being reported:

    In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today’s machine learning scholarship and calls for improved scientific practices in this area.

  3. Your blood contains about four grams of glucose/sugar (it is a tiny amount).
  4. Our brain oscillates at a rate of about 40 Hz (40 times per second). Some researchers may have found the cells responsible for coordinating these waves.
  5. Are we suffering from record-setting heat waves? A recent American government report concludes that we are not:

    (…) the warmest daily temperature of the year increased in some parts of the West over the past century, but there were decreases in almost all locations east of the Rocky Mountains. In fact, all eastern regions experienced a net decrease (…), most notably the Midwest (about 2.2°F) and the Southeast (roughly 1.5°F). (…) As with warm daily temperatures, heat wave magnitude reached a maximum in the 1930s.

    The same report observes that cold extremes have become less common, however.

Science and Technology links (July 13th 2019)

  1. Drinking juice increases your risk of having cancer.
  2. College completion rates are increasing and this may have to do with lower standards.
  3. Australia is going to try a flu vaccine that was partially designed with help from an artificial intelligence called Sam. Sam acquires knowledge and has new ideas, according to its creators. I think that this is misleading wording. What is real, however, is that we routinely underutilize software: many designs that are made entirely by human beings could be greatly improved via software. We are nowhere close to the end of the information technology revolution.
  4. Videoconferencing is now common. We know that it does not work as well as it should in part because we keep looking at the screen, not at the camera, and this breaks eye contact… an important behavior for human beings. Apple is deploying a fix for this problem in its own videoconferencing software by slightly altering the image to make it appear as if you were looking straight at the camera. It should be available publicly in a few months. Hopefully, other videoconferencing software will follow the lead.
  5. Omega 3 supplements lower risk of heart attacks. Calcium with vitamin D increases the risk of strokes. Multivitamins and antioxidants have no significant effect on mortality.
  6. People who live in cities are not less civil than people who live in the country. I think that if you live in a city, you have more social interactions and are therefore more likely to have rude encounters per time period. However, people in a big city are just as nice as people in a village, on average.
  7. Women with children who drink alcohol and use illegal substances are more likely to be a relationship.
  8. Researchers who study obesity frequently (50% of the time) confuse correlation or association with causality.
  9. Though the United States is often thought to be a a leader in research, in a comparison between the leading economies (G20), the researchers found that the output and impact of American researchers are in decline and that the output per researcher has fallen below the G20 average.
  10. Most new drugs entering the market are useless. I think that this should be expected: progress is hard and most of our attempts fail, even if we can only recognize the failures much later. In my mind, we should not worry that too many useless drugs are marketed, we should fear that companies shy away from risky new ventures. Don’t blame the innovators when they take risks and fail, encourage them to fail more often, to take bigger risks.
  11. Though our brains make new brain cells all the time, this effect (neurogenesis) can be slowed or prevented by our own immune system.

Parsing JSON quickly on tiny chips (ARM Cortex-A72 edition)

I own an inexpensive card-size ROCKPro64 computer ($60). It has a ARM Cortex-A72 processors, the same processors you find in the recently released Raspberry Pi 4. These are processors similar to those you find in your smartphone, although they are far less powerful than the best Qualcomm and Apple have to offer.

A common task in data science is to parse JSON documents. We wrote a fast JSON parser in C++ called simdjson; to go fast, it uses fancy (‘SIMD‘) instructions. I wondered how well it would fare on the ARM Cortex-A72 processor.

I am comparing against to top-of-the-line fast parsers (RapidJSON and sajson). I cannot stress enough how good these parsers are: out of dozens of JSON parsers, they clearly stand out. I think it is fair to say that RapidJSON is the ‘standard’ parser in C++. Of course, our parser (simdjson) is good too.

Here are the speeds compared on some of our standard documents, I use GNU GCC 8.3.

file simdjson RapidJSON sajson*
twitter 0.39 GB/s 0.14 GB/s 0.27 GB/s
update-center 0.35 GB/s 0.14 GB/s 0.27 GB/s
github_events 0.39 GB/s 0.20 GB/s 0.29 GB/s
gsoc-2018 0.52 GB/s 0.20 GB/s 0.36GB/s

So it is interesting to note that code optimized with fancy ‘SIMD‘ instructions runs fast even on small ARM processors.

Yet these speeds are far below what is possible on better processors. Our simdjson parser on an Intel processor can easily achieve speeds well beyond 2 GB/s. There are at least three reasons for the vast difference:

  1. Intel processors have ‘faster’ instructions (AVX2) that can process 256-bit registers. There is no equivalent on ARM. In our case, this means that the Intel processor only need about 75% as many instructions as the ARM processors.
  2. Intel processors run at a higher clock. It is easy to find an Intel processor that reach 3.5 GHz or more, maybe much more. My ROCKPro64 runs at 1.8 GHz, so roughly half the speed.
  3. Intel processors easily retire 3 instructions per cycle on a task like JSON parsing whereas the ARM Cortex-A72 is capable of about half of that.

The last two points alone explain about a 4x difference in favour of Intel processors.

To reproduce the results, using Docker, you can do…

git clone https://github.com/lemire/simdjson.git
cd simdjson
docker build -t simdjson . 
docker run --privileged -t simdjson

Credit: Geoff Langdale with contributions from Yibo Cai and Io Daza Dillon.

 

*- sajson does not do complete UTF-8 validation unlike the other two parsers.

 

Parsing JSON using SIMD instructions on the Apple A12 processor

Most modern processors have “SIMD instructions“. These instructions operate over wide registers, doing many operations at once. For example, you can easy subtract 16 values from 16 other values in a single SIMD instruction. It is a form of parallelism that can drastically improve the performance of some applications like machine learning, image processing or numerical simulations.

A common data format on the Internet is JSON. It is a ubiquitous format for exchanging data between computers. If you are using some data-intensive web or mobile application written in this decade, it almost surely relies on JSON, at least in part.

Parsing JSON can become a bottleneck. The simdjson C++ library applies SIMD instruction to the problem of parsing JSON data. It works pretty well on modern Intel processors, the kind you have in your laptop if you bought it in the last couple of years. These processors have wide (256-bit registers) and a corresponding set of instructions (AVX2) that is powerful.

It is not the end of the track: the upcoming generation of Intel processors support AVX-512 with its 512-bit registers. But such fancy processors are still uncommon, even though they will surely be everywhere in a few short years.

But what about processors that do not have such fancy SIMD instructions? The processor in your iPhone (an Apple A12) is an ARM processor and it “merely” has 128-bit registers, so half the width of current Intel processors and a quarter of the width of upcoming Intel processors.

It would not be fair to compare an ARM processor with its 128-bit registers to an Intel processor with 256-bit register… but we can even the odds somewhat by disabling the AVX2 instructions on the Intel processor, and forcing to rely only on smaller 128-bit registers (and the old SSE instructions sets).

Another source of concern is that mobile processors run at a lower clock frequency. You cannot easily compensate for differences in clock frequencies.

So let us run some code and look at a table of results! I make available the source code necessary to build an ios app to test the JSON parsing speed. If you follow my instructions, you should be able to reproduce my results. To run simdjson on an Intel processor, you can use the tools provided with the simdjson library. I rely on GNU GCC 8.3.

file AVX (Intel Skylake 3.7 GHz) SSE (Intel Skylake 3.7 GHz) ARM (Apple A12 2.5 GHz)
gsoc-2018 3.2 GB/s 1.9 GB/s 1.7 GB/s
twitter 2.2 GB/s 1.4 GB/s 1.3 GB/s
github_events 2.4 GB/s 1.6 GB/s 1.2 GB/s
update-center 1.9 GB/s 1.3 GB/s 1.1 GB/s

So we find that the Apple A12 processor is somewhere between an Intel Skylake processor with AVX disabled and a full Intel Skylake processor, if you account for the fact that it runs at a lower frequency.

Having wider registers and more powerful instructions is an asset: no matter how you look at the numbers, AVX instructions are more powerful than ARM SIMD instructions. Once Intel makes AVX-512 widely available, it may leave many ARM processors in the dust as far as highly optimized codebases are concerned. In theory, ARM processors are supposed to get more modern SIMD instructions (e.g., via the Scalable Vector Extension), but we are not getting our hands on them yet. It would be interesting to know whether Qualcomm and Apple have plans to step up their game.

Note: There are many other differences between the two tests (Intel vs. Apple). Among them is the compiler. The SSE and NEON implementations have not been optimized. For example, I merely ran the code on the Apple A12 without even trying to make it run faster. We just checked that the implementations are correct and pass all our tests.

Credit: The simdjson port to 128-bit registers for Intel processors (SSE4.2) was mostly done by Sunny Gleason, the ARM port was mostly done by Geoff Langdale with code from Yibo Cai. Io Daza Dillon completed the ports, did the integration work and testing.

Science and Technology links (July 6th, 2019)

  1. Jim Keller, the vice president of silicon engineering at Intel, is optimistic regarding the continued exponential progress in computing: “It’s going to keep going, Moore’s law is relentless.”
  2. England did away with free college. The result?

    England’s shift has resulted in increased funding per head and rising enrolments, with no apparent widening of the participation gap between advantaged and disadvantaged students.

    The authors make available a short summary of their paper.

  3. Japanese have significant lower risks of cardiovascular diseases. Yet the difference is not explained with normal risk factors such as cholesterol (which is higher in Japan) or blood pressure (higher in Japan).
  4. A protein called eNAMPT has a rejuvenating effect on mice when injected in the blood. Older mice and humans have lower levels of this protein compared to the young.
  5. Music training does not improve academic outcomes.
  6. We can grow functional hair out of stem cells.
  7. Even a little bit of exercise appears to be enough to promote learning and make the brain healthier. (Source: Nature, tested in mice)
  8. Future computers might benefit from persistent memory that is both fast and requires very little power. (Source: Nature)
  9. We sent a spaceship that runs on a solar sail. A solar sail is,
    as the name suggest, a sail pushed by solar winds.
  10. The Bitcoin cryptocurrency uses about 0.2% of the world’s electricity consumption (46 TWh).
  11. China is producing over 25% of the world’s factory output. The USA is second at 17%, Japan is at less than half the USA while India is a mere 3%.
  12. A Russian scientist is demanding the right to do “germline gene editing” to cure deafness in some families. Germline editing is drastic in the sense that it changes the genome of children and all their descendants.

A fast 16-bit random number generator?

In software, we often need to generate random numbers. Commonly, we use pseudo-random number generators.

A simple generator is wyhash. It is a multiplication followed by an XOR:

uint64_t wyhash64_x; 


uint64_t wyhash64() {
  wyhash64_x += 0x60bee2bee120fc15;
  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
  uint64_t m1 = (tmp >> 64) ^ tmp;
  tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
  uint64_t m2 = (tmp >> 64) ^ tmp;
  return m2;
}

It generates 64-bit numbers. I was recently asked whether we could build the equivalent but a smaller scale.

What if you only wanted 16-bit integers and you wanted to follow the model of the wyhash64 function? This might be useful if you are working with a limited processor and you only had modest needs. Here is my proposal:

uint16_t wyhash16_x; 

uint32_t hash16(uint32_t input, uint32_t key) {
  uint32_t hash = input * key;
  return ((hash >> 16) ^ hash) & 0xFFFF;
}


uint16_t wyhash16() {
  wyhash16_x += 0xfc15;
  return hash16(wyhash16_x, 0x2ab);
}

To use this code, you should first initialize wyhash16_x to a value of your choice. You may also be interested in replacing the increment (0xfc15) by any other odd integer. The period of this generator is 65536 meaning that after 65536 values, it comes back in full circle.

The essential idea is in the hash16 function. For a given key value, we have a map from 16-bit values to 16-bit values. The map is not invertible, in general. I pick the key value to 0xfc15. This choice “maximizes” the avalanche effect. That is, if you take a given value, hash it with hash16, then modify a single bit, and hash it again, the difference between the two outputs should have about 8 bits flipped (out of 16). That is, changing a little bit the input value should change the output a lot.

The hash16 function is not invertible… to make it invertible I would need to make the avalanche effect quite weak. With 0xfc15, its domain (set of output values) is only of size 44,114 whereas there are 65536 distinct 16-bit values. Is that bad? Well. If you generate 65536 random values between 0 and 65536, how many distinct values do you expect? The answer is about 41,427. So, if anything, my hash16 function might have too large of a domain.

If you only need to generate a few thousand 16-bit values, this simple generator might be good enough. It is also going to be fast if you have a fast 32-bit multiplier. Evidently, if you are planning to do serious number crunching, it will quickly show its limits. I did not even try to push it through standard statistical tests as none of them are designed for such small generators.

What I found interesting is that by optimizing the avalanche effect, I also got a generator that had a decent image size.

The source code I used to analyze this problem is available.

Appendix. Furthermore you can generate integer values in a range [0,s) efficiently:

uint16_t rand_range16(const uint16_t s) {
    uint16_t x = wyhash16();
    uint32_t m = (uint32_t)x * (uint32_t)s;
    uint16_t l = (uint16_t)m;
    if (l < s) {
        uint16_t t = -s % s;
        while (l < t) {
            x = wyhash16();
            m = (uint32_t)x * (uint32_t)s;
            l = (uint16_t)m;
        }
    }
    return m >> 16;
}

Science and Technology links (June 29th 2019)

  1. Consuming pornography has no effect on sexual desire for one’s partner.
  2. Exercise may increase the likelihood that a woman will become pregnant.
  3. Consuming more protein make it less likely that you will become frail as you age.
  4. There is no correlation between how often university professors fly and their academic success. Furthermore, academics who study environmental issues like global warming fly just as much as the other professors.
  5. In mammals, the heart does not regenerate. Dead heart cells are dead and no replacement is produced. New research shows that this can be changed, in mice, by pumping them up with a small molecule called miR-294. The treated mice saw good heart regeneration.
  6. In human beings, an important part of the immune system, the thymus, simply disappears over time, leaving older people with a weaker immune system. In mice, this can be reversed using transgenic exosomes.

Bounding the cost of the intersection between a small array and a large array

Consider the scenario where you are given a small sorted array of integers (e.g., [1,10,100]) and a large sorted array ([1,2,13,51,…]). You want to compute the intersection between these two arrays.

A simple approach would be to take each value in the small array and do a binary search in the large array. Let the size of the large array be n, then a binary search takes up to log n comparisons and data accesses. Of course, log n  may not be an integer, so I need to round it up to the smallest integer at least as large as log n: ceil( log n ). If the small array contains k elements, then I need a total of k ceil( log n ) comparisons. Each comparison involves one data access in the large array.

Is k log n comparisons the best bound? Clearly not: if k is almost as big as n, then k log n can far exceeds k+n. Yet a standard merge algorithm (as in the Merge Sort algorithm) requires as little as k+n comparisons. So for the bound to be reasonable, we must require that k is much smaller than n.

Even so, intuitively, it is wasteful to do k distinct binary searches. If the values in the small array are in sorted order, then you have new knowledge after doing the first binary search that you can reuse to help in the next binary search.

How few comparisons can you do? We can use an information-theoretical argument. The basic idea behind such an argument is that for an algorithm based on a decision tree to generate M distinct possible output, it must involve up to ceil( log( ) ) decisions in the worst case, a count corresponding to the height of the corresponding balanced tree.

To be clear, this means that if an adversary gets to pick the input data, and you have divine algorithm, then I can give you a lower bound on the number of decisions that your algorithm takes. It is entirely possible that, in practice, you will do better given some data; it is also entirely possible that you could do much worse, evidently. Nevertheless, let us continue.

We can recast the intersection problem as the problem of locating, for each key in the small value, the largest value in the large array that is small or equal to the key. How many possible outputs (of this algorithm) are there? I have to choose k values from a set of n elements, but the order does not matter: that’s nk/k! possibilities. This means that if my algorithm relies on a (balanced) decision tree, it must involve ceil( k log n – log k! ) decisions.

Let us put some numbers in there. Suppose that n is 4 million and k is 1024. Then ceil( log n ) is 22 so each key in the small array will involve 22 decisions. On a per key basis, our tighter model gives something like log n – (log k!)/k. The log of the factorial is almost equal to k log k by Stirling’s approximation, so we can approximate the result as log n – log k.

Let us plug some numbers in these formula to track the lower bound on the number of decisions per key in the small array:

k log n log n – (log k!) / k
8 22 20
16 22 19
64 22 17
1024 22 13

This suggests that you may be able to do quite a bit better than k distinct binary searches.

Often, however, it is not so much the decisions that one cares about as much as the number of accesses. Can this value 13 in table when n is 1024 be taken a lower bound on the number of accesses? Not as such because you can access one value from the large array and then reuse it for many comparisons.

To complicate matters, our systems have cache and cache lines. The cache means that repeated accesses to the same value can be much cheaper than accesses to distinct values and may even be free (if the value remains in a register). And our cache does not operate on a per-value basis, but rather data is loaded in chunks of, say, 64 bytes (a cache line).

All in all, my derivations so far may not be highly useful in practical applications, except maybe to argue that we ought to be able to beat the multiple binary search approach.

Can we sketch such an approach? Suppose that we start by sampling B different values in the big array, at intervals of size n / ( B + 1 ). And then we issue the same k binary searches, but targeting one of the interval of size n / ( B + 1 ). This will incur B  + k log (n / B + 1 )) accesses in the large array. Let us pick B so as to minimize the number of accesses. We could pick B to be k / ln (2) – 1. We get k / ln (2) – 1 + k log (n ln (2) / k) accesses in the large array. For simplicity, I am ignoring the fact that B must be an integer. I am also ignoring the cost of matching each value in the small array to the right subinterval. However, if I am only interested in the number of accesses in the large array, it is an irrelevant cost. Even so, under the assumption that k is small, this is a small cost (O(k)).

Is k / ln (2) + k log (n ln (2) / k) accesses in the large array much better than the naive approach doing k log n accesses? It is unclear to me how much better it might be on real systems once caching effects are accounted for.

Credit:  Travis Downs for the initial questions and many ideas. The mistakes are mine.

Further reading: SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience 46 (6), 2016

Science and Technology links (June 22nd 2019)

    1. An implemented chip can improve long-term memory. It is currently impractical, but a start-up company will try to bring this technology to market.
    2. Never before so many simultaneous and distinct new approaches have been in development in the history of drug research.
    3. Once more, we have evidence that exercise can rejuvenate your muscles.
    4. After controlling for sun exposure and other factors, sunscreen neither causes nor prevents skin cancers. Sun exposure itself is associated with fewer cancers and better cardiovascular health. Of course, increased sun exposure is also associated with an increased rate of skin cancers. Yet the association is relatively modest, e.g., living all day under the sun probably does not double your risk of skin cancers (while it has other strong benefits). Note that there are other benefits to sunscreen than cancer prevention.
    5. Fast food is most popular among the upper-middle income brackets.

How fast is getline in C++?

A standard way to read a text file in C++ is to call the getline function. To iterate over all lines in file and sum up their length, you might do as follows:

while(getline(is, line)) {
    x += line.size();
}

How fast is this?

On a Skylake processor with a recent GNU GCC compiler without any disk access, it runs at about 2 GB/s. That’s slower than the maximal throughput of a good flash drive. These results suggest that reading a text file in C++ could be CPU bound in the sense that buying an even faster disk would not speed up your single-threaded throughput.

nanoseconds per byte 0.46 ns
speed 2.0 GB/s

My code is available.

If you write code that processes the strings generated by the getline function calls, in the worst case, the total time will be the sum of the time required by the getline function plus the time required by your code. In other words, you are unlikely the achieve speeds near 2 GB/s.

In comparison, a software library like simdjson can parse and validate JSON inputs, doing everything from Unicode validation to number parsing, at a speed of over 2 GB/s.

I have not tried to do so, but you can locate lines and iterate over them at much greater speeds than 2 GB/s.