Filling large arrays with zeroes quickly in C++

Travis Downs reports that some C++ compilers have trouble filling up arrays with values at high speed. Typically, to fill an array with some value, C++ programmers invoke the std::fill. We might assume that for a task as simple as filling an array with zeroes, the C++ standard library would provide the absolute best performance. However, with GNU GCC compilers, that is not the case.

The following line of C++ code gets compiled as a loop that fills each individual byte with zeroes when applying the conventional -O2 optimization flag.

std::fill(p, p + n, 0);

When the array is large, it can become inefficient compared to a highly efficient implementation like the C function like memset.

memset(p, 0, n);

Travis reports that the difference is a factor of thirty. He also explains how to fix the C++ so that it is as fast as the memset function.

What about very large arrays? What if you need to write zeroes all over a 2GB array?

memset 30 GB/s
std::fill 1.7 GB/s

Roughly speaking, the memset function is 15 times faster than std::fill in my test. This demonstrates that even if you are working with large volumes of data, you can be easily limited by your inefficient implementations. You are not automatically limited by your memory bandwidth. In fact, std::fill in this instance cannot even keep up with a good network adaptor or a fast disk. We routinely parse JSON data, and write out DOM trees, at speeds far higher than 1.7 GB/s.

Another lesson I would derive from this example is that writing your code in idiomatic C++ is not sufficient to guarantee good performance. There is a lot of engineering involved in making C++ fast, and though it often works out magically, it can fail too.

My source code is available.

Allocating large blocks of memory: bare-metal C++ speeds

In a previous post, I benchmarked the allocation of large blocks of memory using idiomatic C++. I got a depressing result: the speed could be lower than 2 GB/s. For comparison, the disk in my laptop has greater bandwidth.

Methodologically, I benchmarked the “new” operator in C++ with initialization, using the GNU GCC compiler with the -O2 optimization flag1.

char *buf = new char[s]();

It turns out that you can do better while sticking with C++. We cannot simply call the new operator without initialization because, in general, it does not result in the memory being actually allocated. However, we can allocate the memory and then make sure that we touch every “page” of memory. On modern Intel systems, pages are effectively always at least as large of 4kB, so we can touch the memory every 4kB. We might call this approach “new and touch”.

char * buf = new char[size];
for (size_t i = 0; i < size; i += 4096) buf[i] = 0;
buf[size - 1] = 0;

Such a new-and-touch strategy should be close to “bare-metal” memory allocation speeds. So how fast is it? It depends on the page size. By default, most systems rely on small (4kB) pages. Allocating many small pages is expensive. Thankfully, Linux can be configured to use huge pages, transparently, via a feature called “transparent huge pages”. And it makes a huge difference!

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 3.9 GB/s 30 GB/s
transparent huge pages 20 GB/s 30 GB/s

I am using a recent Linux system (Ubuntu 16.04), a Skylake processor and GNU GCC 8.3 with the -O2 optimization flag. My source code is available.

It is still the case that allocating memory on most systems is a non-trivial cost since they rely on small 4kB pages. There are fast disks available on the market that have more than 4 GB/s of bandwidth.

Credit: Thanks to Travis Downs and others for their insights and comments.

 

1 Downs found that we get far better performance out of the new operator with initialization under GNU GCC with the more agressive -O3 optimization flag. Performance-wise, it should be close to the “new and touch” approach that I am describing.

How fast can you allocate a large block of memory in C++?

In C++, the most basic memory allocation code is just a call to the new operator:

char *buf = new char[s];

According to a textbook interpretation, we just allocated s bytes1.

If you benchmark this line of code, you might find that it almost entirely free on a per-byte basis for large values of s. But that is because we are cheating: the call to the new operation “virtually” allocates the memory, but you may not yet have actual memory that you can use. As you access the memory buffer, the system may then decide to allocate the memory pages (often in blocks of 4096 bytes). Thus the cost of memory allocation can be hidden. The great thing with a virtual allocation is that if you never access the memory, you may never pay a price for it.

If you actually want to measure the memory allocation in C++, then you need to ask the system to give you s bytes of allocated and initialized memory. You can achieve the desired result in C++ by adding parentheses after the call to the new operator:

char *buf = new char[s]();

Then the operating system actually needs to allocate and initialize memory2. It may still cheat in the sense that it may recycle existing blocks of memories or otherwise delay allocation. And I expect that it might do so routinely if the value of s is small. But it gets harder for the system to cheat as s grows larger.

What happens if you allocate hundreds of megabytes in such a manner? The answer depends on the size of the pages. By default, your system probably uses small (4kB) pages. Under Linux, you can enable “transparent huge pages” which dynamically switches to large pages when large blocks of memory are needed. Using larger pages means having to allocate and access fewer pages, so it tends to be cheaper.

In both instances, I get around a couple of gigabytes per second on a recent Linux system (Ubuntu 16.04) running a conventional Skylake processor. For comparison, you can set memory to zero at tens gigabytes per second and my disk can feed data to the system at more than 2 GB/s. Thus, at least on the system I am currently using, memory allocation is not cheap. My code is available, I use GNU GCC 8.3 with the -O2 optimization flag.

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 1.6 GB/s 30 GB/s
transparent huge pages 2.4 GB/s 30 GB/s

You can do better with different C++ code, see my follow-up post Allocating large blocks of memory: bare-metal C++ speeds.

Further remarks. Of course, you can reuse the allocated memory for greater speeds. The memory allocator in my standard library could possibly do this already when I call the new operator followed by the delete operator in a loop. However, you still need to allocate the memory at some point, if only at the beginning of your program. If you program needs to allocate 32 GB of memory, and you can only do so at 1.4 GB/s, then your program will need to spend 23 seconds on memory allocation alone.

Footnotes:

  1. Several readers have asked why I am ignoring C functions like calloc, malloc and mmap. The reason is simple: this post is focusing on idiomatic C++.
  2. You might wonder why the memory needs to be initialized. The difficulty has to do with security. Though it is certainly possible for the operating system to give you direct access to memory used by another process, it cannot also pass along the data. Thus it needs to erase the data. To my knowledge, most systems achieve this result by zeroing the new memory.

Science and Technology links (January 11th 2020)

  1. The extra wealth that ones acquires by attending college is now estimated to be indistinguishable from zero. The authors control for the parent’s education and the individual’s financial acumen.
  2. A drug approved for use in human beings (Rapamycin) appears to rejuvenate dental health in mice.
  3. Low-fat milk increases the obesity risk of children when compared with high-fat milk.
  4. It appears that type 2 diabetes might caused by fat. Keeping your abdominal fat in check might be the best way to prevent diabetes (speculative?).

How I teach database design

Most software runs on top of databases. These databases are organized logically, with a schema, that is a formal description. You have entities (your user), attributes (your user’s name) and relationships between them.

Typical textbook database design comes from an era when it was possible to believe that you knew, at the start of the project, what the data would be like. And you could believe that it would not change. You could certainly believe that your 200-page design document would be updated and maintained, and it would be compulsory reading for computer folks in your company for decades. When building an accounting application, you would query the databases using SQL. All of the data was accessed through the databases. The database was the interface. And the same set of centralized databases running on a single computer would serve the entire operation. And everyone would know what the field “name” meant, exactly, in the table “user”. There would not be mergers between companies, or corporate pivots taking you into a new direction. You could believe that your conceptual and logical principles were independent from physical implementation principles. There would be no need to ever duplicate data for performance.

Quite often, it is not how systems are built today, if they ever were built this way. You organize functions in your organization in terms of services. And the services communicate between themselves. People who do not have to work directly on your service do not need to know how the data is organized or even if the organization is publicly documented at all. You have to provide them an interface, and that is all that they should need. On your end, you cannot rely on the database engine alone to ensure that the data remains usable. “But my data follows a strict schema” is not an excuse for failure. Your users do not want to know about your data layout, they do not want to have to know.

When organizing your data, you will probably get it wrong. If not today, then next week or next year. Life is fast changing and your knowledge is always incomplete. The waterfall model is naive to the point of being actively harmful. So you want to isolate the data layout from its use as much as you can.

In textbooks, the core principle of database design is often “do not duplicate data”. The logic is sound: if you duplicate data, at some point these values may go out of sync and what do you do then? If I see a programmer duplicating data all over his code, I know he lacks experience. However, it is also the case that deliberate duplication is essential. Sometimes you need duplication for performance and performance matters. Sometimes you need duplication to avoid unwieldy complexity, and managing complexity matters. And you need duplication because your information system runs on more than one processor.

You want to store everything in one giant system so that there is no duplication and everything is consistent? This thinking does not scale. It does not scale with respect to complexity or performance. And it does not even achieve consistency in the real world.

The data you will be receiving and maintaining is not pristine. In fact, it is full of errors and unknowns. Marketing thinks that John Smith’s phone number if something whereas accounting thinks that it is something else? You know what? Maybe they are both right or maybe they are both wrong. And it is worse than you think because there are unknown unknowns: nothing in your database can be regarded without suspicion. And the minute you have incomplete or uncertain data, your dream of enforcing the relationships between entities may just become naive.

Does that mean that anything goes and that you don’t need to worry about database design? Quite the opposite: you ought to be agonizing over every decision and making sure that you never locking yourself into a corner. Because you will have duplication and multiple systems, you will have inconsistency and you will need to deal with them, and a single strategy won’t work for all cases.

So how do I teach database design? I ask students to be critical about the textbook material, and I expose them as much as possible to real-world example where things are not perfectly neat. Begin by looking under the hood of your favorite open-source project.

Further reading: On Desirable Semantics of Functional Dependencies over Databases with Incomplete Information, Fundamenta Informaticae 158 (2018); Functional dependencies with null markers, Computer Journal 58 (2015); A Call to Arms: Revisiting Database Design, SIGMOD Record 40 (2011).

My Science and Technology review for 2019

I like to end every year with my selection of the most significant science and technology events.

  1. In 2019, you could buy a computer from Apple with 1.5 terabytes of memory. And by memory, I mean fast internal memory (RAM). Of course, it would cost tens of thousands of dollars. You and I are unlikely to have that much memory on our computers in the near future. Yet we are all benefiting from faster disks, making the distinction between disk and memory harder. (Note that it has been possible to buy computers with a terabyte of memory and more from other vendors, but not typically in a mainstream desktop form.)
  2. A team lead by Conboy showed that we could rejuvenate multiple organs in old mice merely by giving them Alk5i and oxytocin. Quoting from the paper:

    Alk5i plus OT quickly and robustly enhanced neurogenesis, reduced neuro-inflammation, improved cognitive performance, and rejuvenated livers and muscle in old mice.

    You can purchase oxytocin on the Internet and many people take it already to improve their sleep.

  3. We can tell how old someone is by looking at “epigenetic markers”. That is, from one of your cells, we can look at how your DNA is arranged and expressed, and it tells us your age. In some sense, it defines your “biological age” as opposed to your “chronological age”. As far as we could tell, there was no way to intervene and reverse this age. However, in a study reported in Nature, researchers found such age reversal in a small clinical trial. This age reversal came about accidentally when trying to regrow the “thymus” in older people. The thymus naturally disappears with age, leaving older people with weakened immune systems. The small clinical trial also showed that the thymus was regrown, something that is unexpected.
  4. Apple released a smart watch with an integrated ECG, to monitor your heart conditions.
  5. As we age, we accumulate “senescent cells”. These are old cells that create trouble and refuse to die. We know now to wipe them out in mice, using Dasatinib and Quercetin. It turns out that the exact same approach works in human beings. At least in mice, clearing senescent cells lead to all sorts of health improvements.
  6. A drone attack wiped out 50% of Saudi Arabia’s oil supply for several days. This is a huge impact given that Saudi Arabia provides about 10% of the oil supply worldwide. It is a testimony of how good drone technology has become.
  7. Surgeons are putting patients in suspended animation (they stopped their biological functions) during surgery, replacing their blood with a cold saline solution.

Science and Technology links (December 21st 2019)

  1. The number of research papers with more than 1000 authors is increasingly quickly and reaching many fields.
  2. Researchers at Facebook use neural networks to solve fancy algebraic problems that may be out of reach of computer algebra systems.
  3. At least on the short term, wind turbines may contribute more to global warming than equivalent fossil fuel technologies. The warming effect has been empirically measured.
  4. The Bengal tiger population is increasing quickly in India.

Xor Filters: Faster and Smaller Than Bloom Filters

In software, you frequently need to check whether some objects is in a set. For example, you might have a list of forbidden Web addresses. As someone enters a new Web address, you may want to check whether it is part of your black list. Or maybe you have a large list of already used passwords and you want to check whether the proposed new password is part of this list of compromised passwords.

The standard way to solve this problem is to create some key-value data structure. If you have enough memory, you might create a hash table. Or you might use a good old database.

However, such approaches might use too much memory or be too slow. So you want to use a small data structure that can quickly filter the requests. For example, what if you had a tiny data structure that could reliably tell you that your password is not in the list of compromised password?

One way to implement such a filter would be to compute a hash value of all your objects. A hash value is a random-looking number that you generate from your object, in such a way that the same object always generate the same random-looking number, but such that other objects are likely to generate other numbers. So the password “Toronto” might map to the hash value 32. You can then store these small numbers into a key-value store, like a hash table. Hence, you do not need to store all of the objects. For example, you might store 32-bit numbers for each possible password. If you give me a potential password, I check whether its corresponding 32-bit value is in the list and if it is not, then I tell you that it is safe. So if you give me “Toronto”, I check whether 32 is in my table. Otherwise, I send your request to a larger database, for a full lookup. The probability that I send you in vain for a full lookup is called the “false positive probability”.

Though this hash table approach might be practical, you could end up using 4 bytes per value. Maybe this is too much? Bloom filters come to the rescue. A Bloom filter works similarly, except that you compute several hash values from your object. You use these hash values as indexes inside an array of bits. When adding an object to the set, you set all bits corresponding to the objects to one. When you receive a new object and you want to check whether it belongs to the set, you can just check whether all of the bits have been set. In practice, you will use far fewer than 4 bytes per value (say 12 bits) and still be able to achieve a false positive rate of less than 1%.

While the Bloom filter is a textbook algorithm, it has some significant downsides. A major one is that it needs many data accesses and many hash values to check that an object is part of the set. In short, it is not optimally fast.

Can you do better? Yes. Among other alternatives, Fan et al. introduced Cuckoo filters which use less space and are faster than Bloom filters. While implementing a Bloom filter is a relatively simple exercise, Cuckoo filters require a bit more engineering.

Could we do even better while limiting the code to something you can hold in your head?

It turns out that you can with xor filters. We just published a paper called Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters that will appear in the Journal of Experimental Algorithmics.

The following figure gives the number of bits per entry versus the false positive probability. Xor filters offer better accuracy for a given memory budget. With only 9 bits per entry, you can get a false positive probability much less than 1%.

The complete implementation in Go fits in less than 300 lines and I did not even try to be short. In fact, any semi-competent Go coder can make it fit within 200 lines.

We also have an optimized C version that can be easily integrated into your projects since it is a single-header. It is larger than 300 lines, but contains different alternatives including an approach with slightly faster construction. I wrote a small demonstration in C with a compromised password detection problem. The xor filters takes a bit longer to build, but once built, it uses less memory and is about 25% faster.

We also have Java and C++ implementations. We have Rust, Python
and Erlang versions.

An xor filter is meant to be immutable. You build it once, and simply rebuild it when needed. Though you can update a Bloom filter, by adding keys to it, it means either overallocating the initial filter, or sacrificing accuracy. These filters are typically not meant to be used as dynamic data structure (unlike a hash table) since they have a fixed capacity. In the case of cuckoo filters, once you approach the maximum capacity (say within 94%), then the insertion of new values may fail, and the solution when it does is to rebuild the whole thing.

Deletions are generally unsafe with these filters even in principle because they track hash values and cannot deal with collisions without access to the object data: if you have two objects mapping to the same hash value, and you have a filter on hash values, it is going to be difficult to delete one without the other. No matter what, you must somehow keep track of what was added in the filter and what was removed, and the filter itself cannot tell you. The same issue is at play when you are building the filter: the filter itself cannot tell you whether you already added an element: you have to keep track of this information yourself.

Furthermore all these filters are frequently used in a concurrent setting, with multiple threads. You typically cannot safely modify the filter while querying it. The simplest solution to avoid expensive locks, is to make it immutable. Keep in mind that these filters are meant to be small.

The construction of the xor filter requires several bytes of temporary memory per entry, but this memory can be released immediately after construction and a compact data structure remains, using just a few bits per entry.

If you do not have stringent requirements regarding memory usage, other techniques might be preferable like blocked Bloom filters which have unbeatable speed (at the expense of higher memory usage). The xor filters are really at their best when you want high speed and tight memory usage.

Credit: This is joint work with Thomas Mueller Graf. Xor filters are basically an efficient implementation and specialization of a theoretical construction, the Bloomier filter.

A look back to my 2010 predictions for 2020

Back in 2010, I wrote a post Who is going to need a database engine in 2020?

Let me revisit some of my 2010 statements.

Apple will sell desktops with 1 TB of RAM in 2020.

I am sure that the prediction sounded insane back in 2010, but it actually happened. A Mac Pro can have up to 1.5TB of RAM right now.

Clever programmers can write much faster specialized engines. Obviously, programmers will need help. They will need great librairies to help with data processing, data compression, and data architecture.

This happened too. The number of custom data processing engines has exploded: Elasticsearch (2010), Apache Spark (2014), ClickHouse (2016), and so forth. The great libraries for data processing have also materialized. In this instance, I put my fingers where my mouth was and invested my own time in this trend, to good effect I would say.

I won’t make predictions except to say that I expect new and more powerful programming languages to replace the existing ones. I’ll be pretty sad if in 2020, I’m still primarily using Python, Java and C++. There is so much innovation out there that something strong has to emerge out of it.

New contenders have emerged: Swift (2014), Go (late 2009), Rust (2010), Julia (2012). I have done some work in all three and they are a step forward.

I am still mostly programming in Python, Java, C and C++.

I would say that C++ has become a much better language since 2010. It is not any easier, but it has exceeded my expectations.

If anything, Python’s popularity is far higher than it was back in 2010.

So I am mildly sad. I am still doing a lot of C++ and it is really, really hard. However, I could productively switch to shinier programming languages.

Overall I would say that my 2010 beliefs about the future were accurate, at least as far as this one my blog post is concerned.

Science and Technology links (December 14th 2019)

  1. The computation capacity needed by artificial intelligence doubles every 3.4 months. In parallel, we are making fast progress in hardware and software: what took three hours a year and half ago can now take less than two minutes.
  2. In Luxembourg, a small European country, all public transportation is free. Luxembourg has also a gay prime minister who is married.
  3. Men and women have very different personalities. Knowing only the personality traits of someone, you can tell whether they are a man or a woman with 85% accuracy.
  4. We are deploying drones equipped with machine guns.
  5. Alzheimer’s is a terrible disease and there is currently no proven way to even slow down the disease. We might be very close to approving the first drug which might favorably alter the progression of the disease.
  6. We are approving new medical therapies at record rates.
  7. Many worry that computers and artificial intelligence will destroy so many jobs so quickly that human workers will never recover. Scholl and Hanson find no evidence for the emergence of such a trend.
  8. Crows can use tools and solve multi-step puzzles.
  9. As we age, we accumulate defective non-dividing cells called senescent cells. The older we get the more quickly we accumulate them. The progression is exponential.

Credit: Thanks to Peter Turney for sending me a link to the 2019 AI Index report.

Are 64-bit random identifiers free from collision?

It is common in software system to map objects to unique identifiers. For example, you might map all web pages on the Internet to a unique identifier.

Often, these identifiers are integers. For example, many people like to use 64-bit integers. If you assign two 64-bit integers at random to distinct objects, the probability of a collision is very, very small. You can be confident that they will not collide.

However, what about the case where you have 300 million objects? Or maybe 7 billion objects? What is the probably that at least two of them collide?

This is just the Birthday’s paradox. Wikipedia gives us an approximation to the collision probability assuming that the number of objects r is much smaller than the number of possible values N: 1-exp(-r**2/(2N)). Because there are so many 64-bit integers, it should be a good approximation.

Number of objects Collision probability
500M 0.7%
1B 3%
1.5B 6%
2B 10%
4B 35%

Thus if you have a large system with many objects, it is quite conceivable that your randomly assigned 64-bit identifiers might collide. If a collision is a critical flaw, you probably should not use only 64 bits.

Amazon’s new ARM servers: Graviton 2

Most servers on the Internet run on x64 processors, mostly made by Intel. Meanwhile, most smartphones run ARM processors.

From a business perspective, these are different technologies. The x64 processors are mostly designed by only two companies (Intel and AMD), with one very large dominant player (Intel). In contrast, ARM processors come in many different flavours. Apple, Qualcomm, Samsung and others design their own ARM processors. This diversity can be either a blessing or a curse. The blessing is that you get a lot of direct competition and many custom solutions. The curse is that it is harder to support ARM systems because there are so many variations.

Amazon is the largest public cloud providers and they are large enough to design their own servers and even their own silicon. Some time ago, they launched a service (Graviton) based on their own ARM-based servers. I tested them out, but the performance just was not there. Amazon just announced a second iteration of these servers called Graviton 2 and they claim a 7-fold performance increase over their previous ARM servers. They are based on processor designs made for servers called Neoverse. I do not yet have access to these servers, but Matthew Wilson from Amazon was kind enough to run my standard JSON parsing benchmark (using simdjson and the twitter.json data file).

Compiling the results in a table suggests that these new Amazon servers have processors that are nearly as good as those in a flagship smartphone.

iPhone XR A12 2.5 GHz 1.3 GB/s
Graviton 2 Neoverse N1 2.5 GHz 1.1 GB/s
Ampere (first generation) Skylark 3.2 GHz 0.37 GB/s
Rockpro64 Cortex-A72 1.8 GHz 0.32 GB/s

My fast JSON parsing benchmark is just something I happen to care about. It is probably not representative of whatever you care about. In particular, it is CPU intensive whereas servers have many other bottlenecks.

Nevertheless, I find these results quite encouraging. If I normalize the speed by the frequency, I get that the new Neoverse N1 processor is 2.5 times faster than the Cortex-A72. When they come out, they may be the fastest publicly accessible ARM servers.

Amazon is claiming that these Graviton 2 servers offer much better performance than Intel-based servers (EC2 M5) and that they will be slightly cheaper. My expectation is that the better performance will be due in large part of a higher number of cores.

Update: Matthew Wilson reports 1.3 GB/s following some optimization work.

Science and Technology links (December 7th 2019)

  1. Incredibly, there is a new simpler way to solve the quadratic formula. I used to rely on the completion of the square, but this is better! There is a video report on the finding.
  2. Surgeons are putting patients in suspended animation (biostasis) during surgery, replacing their blood with a cold saline solution, for time-sensitive interventions.
  3. Statins are a widely prescribed class of medications. They may adversely affect cognition.

AMD Zen 2 and branch mispredictions

Intel makes some of the very best processors many can buy. For a long time, its main rival (AMD) failed to compete. However, its latest generation of processors (Zen 2) appear to roughly match Intel, at a lower price point.

In several benchmarks that I care about, my good old Intel Skylake (2015) processor beats my brand-new Zen 2 processor.

To try to narrow it down, I create a fun benchmark. I run the follow algorithm where I generate random integers quickly, and then check the two least significant bits. By design, no matter how good the processor is, there should be one mispredicted branch per loop iteration.

while (true) {
   r = random_integer()
   if (( r AND 1) == 1) {
     write r to output
   }
   if (( r AND 2) == 2) {
     write r to output
   }
}

I record the number of CPU cycles per loop iteration. This number is largely independent from processor frequency, memory access and so forth. The main bottleneck in this case is branch misprediction.

Intel Skylake 29.7 cycles
AMD Zen 2 31.7 cycles

Thus it appears that in this instance the AMD Zen 2 has two extra cycles of penalty per mispredicted branch. If you run the same benchmark without the branching, the difference in execution time is about 0.5 cycles in favour of the Skylake processor. This suggests that AMD Zen 2 might waste between one to two extra cycles per mispredicted branch.

My code is available. I define a docker container so my results are easy to reproduce.

Instructions per cycle: AMD Zen 2 versus Intel

The performance of a processor is determined by several factors. For example, processors with a higher frequency tend to do more work per unit of time. Physics makes it difficult to produce processors that have higher frequency.

Modern processors can execute many instructions per cycle. Thus a 3.4GHz processor has 3.4 billion cycles per second, but it might easily execute 7 billion instructions per second on a single core.

Up until recently, Intel produced the best commodity processors: its processors had the highest frequencies, the most instructions per cycle, the most powerful instructions and so forth. However, Intel is increasingly being challenged. One smaller company that is making Intel nervous is AMD.

It has been reported that the most recent AMD processors surpass Intel in terms of instructions per cycle. However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

In theory, a processor limited to one instruction per cycle could beat a modern Intel processor on many tasks if they had powerful instructions and faster data access. Thus “work per unit of time normalized per CPU frequency” and “instructions per cycle” are distinct notions.

I have only had access to a recent AMD processors (Zen 2) for a short time, but in this short time, I have routinely found that it has a lower number of instructions per cycle than even older Intel processors.

Let us consider a piece of software that has a high number of instructions per cycle, the fast JSON parser simdjson. I use GNU GCC 8 under Linux, I process a test file called twitter.json using the benchmark command line parse. I record the number of instructions per cycle, as measured by CPU counters, in the two stages of processing. The two stages together effectively parse a JSON document. This is an instruction-heavy benchmark: the numbers of mispredicted branches and cache misses are small. The Skylake processor has the highest frequency. I use an AMD Rome (server) processor.

I find that AMD is about 10% to 15% behind Intel.

processor IPC (stage 1) IPC (stage 2)
Intel Skylake (2015) 3.5 3.0
Intel Cannon Lake (2018) 3.6 3.1
Zen 2 (2019) 3.2 2.8

Another problem that I like is bitset decoding. That is given an array of bits (0s and 1s), I want to find the location of the ones. See my blog post Really fast bitset decoding for “average” densities. I benchmark just the “basic” decoder.

void basic_decoder(uint32_t *base_ptr, uint32_t &base, 
  uint32_t idx, uint64_t bits) {
  while (bits != 0) {
    base_ptr[base] = idx + _tzcnt_u64(bits);
    bits = bits & (bits - 1);
    base++;
  }
}

My source code is available.

processor IPC
Intel Skylake (2015) 2.1
Zen 2 (2019) 1.4

So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!

Of course, your results will vary. And I am quite willing to believe that in many, maybe even most, real-world cases, AMD Zen 2 can do more work per unit of work than the best Intel processors. However I feel that we should qualify these claims. I do not think it is entirely reasonable for AMD customers to expect better numbers of instructions per cycle on the tasks that they care about, and they may even find lower numbers. AMD Zen 2 does not dominate Intel Skylake, it is more reasonable to expect rough parity.

Further reading: AMD Zen 2 and branch mispredictions

Better computational complexity does not imply better speed

A recent magazine article presents a theoretical result: Harvey and van der Hoeven have shown that you can multiply two n-bit integers using O(n log n) complexity. Roughly speaking, this means that as n grows large, there is a mathematical model of how long the resulting algorithm might run that grows like n log n in the worst case. That is, it does not get much worse as n gets larger, mathematically speaking. It is likely that this result will go into the textbook as an important contribution to theoretical computer science.

So far so good. But then the magazine articles then state that the authors have presented a faster way to compute multiplications. That is incorrect. At this point in time, they have not make this demonstration or even this claim, to my knowledge. Possibly they have an implementation somewhere and might be able to answer my questions, but I don’t expect so.

So what is the problem with claiming that they have a faster and more efficient way to do multiplications?

  • Computational complexities like O(n log n) are not a speed or even a time in the real world.
  • They are not even a model of the running time or speed. They are a mathematical asymptotic model for n large. How large must n be for a favorable model to theoretically beat a less favorable model mathematically? Possibly as large as the number of atoms in the universe. Or maybe even larger than that. I am not joking: ‪as far as I can tell, the Harvey and Van Der Hoeven would require more digits per integer than there are atoms in the universe to have a chance at being practical. Please pause to consider: if you are a scientist, you need to make predictions that apply to our universe otherwise you are doing mathematics. Mathematics is a fine activity, but it is not to be confused with science or engineering. I can hear people clamouring that, in the end, big n always win out. Let me quote a famous theoretical computer scientist on this topic:

    Asymptotics will eventually win out, as long as everything else stays fixed. But that’s the precise problem. Everything else doesn’t stay fixed.

  • But it gets worse: these are not scientific models. A scientific model would predict the running time of the algorithm given some implementation, within some error margin. However, these models do nothing of the sort. They are purely mathematical. They are not falsifiable. As long as they are mathematically correct, then they are always true. To be fair, some researchers like Knuth came up with models that closely mimic reasonable computers, but that’s not what people pursuing computational complexity bounds do, typically.
  • Importantly, this means that these asymptotic models make no prediction about the running time of actual code. If I ask Harvey and van der Hoeven how long (in seconds) their algorithm take to compute the multiplication between 1024-byte integers on my macbook, they cannot tell me for their paper alone. Thus they don’t know that it is actually faster than the big-integer library I am using (gmplib).
  • Maybe you would argue that, eventually, their algorithm would beat whatever gmplib has implemented, given large enough integers. But that’s not a scientific (i.e., falsifiable) statement. I could implement their algorithm and apply it to 100000-byte integers and get that their approach is no faster. You could then argue that I need to go even larger. If I do and fail again, you might argue that I need to go even larger. That is not how science or engineering should work. If you have a scientific model, it should make predictions that can be correct or false about the real world. And maybe you end up telling me that if I use 21024 bits per integer, then surely the implementation of the new algorithm is faster: but it is again an unscientific statement because there is no such thing as 21024-bit integers and there will never be in this universe.
  • Possibly Harvey and van der Hoeven have more precise bounds than just a big-O result. But even if they make falsifiable predictions about actual use cases, it does not follow that they are correct. They are working from a mathematical model. It is an idealized version of a computer. Without checking, you cannot tell whether your model is correct. You have to test it out. And, importantly, if you make flawed predictions, then your model is incorrect from a scientific point of view. And what are you modelling exactly? If you are assuming that a computer that manages petabytes of memory works the same as a computer that has gigabytes of memory, something is fishy.

I am not just nitpicking. This is of critical importance. Medical researchers cure Alzheimer’s or cancer in marvelous ways almost weekly using “animal models” or in vitro (in the laboratory). These breakthroughs almost never translate in the real world. It is ridiculously hard to go from models to applications.

If you do theoretical work and arrive at a model that suggests that you have a better, faster algorithm, then you are not nearly done. The map is not the territory. If you are good, then your model should match closely reality. But, no matter how smart you are, and no matter how clever your mathematical work is, you can only claim to be solving a problem faster after you have built it into a computer and recorded the elapsed time.

I am not diminishing Harvey and van der Hoeven accomplishments. If my understanding is correct, their names will be in textbooks for a very, very long time. It is well deserved based on the mathematical construction.

But we are not able to multiply integers faster thanks to their paper. This paper of theirs is not an engineering contribution. To be fair, it may lead to one such contribution. Working with models is often a useful way to do better engineering. However, you have to keep in mind that the model is not reality and that reality the only thing that matters at the end. If you lose track of this important fact, then you are falling back into prescientific thinking.

What if you disagree with my post? What if you think that Harvey and van der Hoeven’s contribution is indeed a step forward engineering-wise? Then you have a clear path forward: provide an implementation, benchmark it against well established software (e.g., gmplib) and show the benefits. No amount of white board arguments can make this requirement go away: show me the code.

Memory parallelism: AMD Rome versus Intel

When thinking about “parallelism”, most programmers think about having multiple processors. However, even a single core in a modern processor has plenty of parallelism. It can execute many instructions per cycle and, importantly, it can issue multiple memory requests concurrently.

Our processors are becoming “more parallel” over time, as is evident by the recent increases in the number of cores. AMD sells 64-core processors. But each individual core is also becoming “more parallel”.

To demonstrate, let me use a memory access test.

The starting point is a shuffled array. You access one entry in the array, read the result, and it points to the next location in the array. You get a random walk through an array. The test terminates when you have visited every location in the array.

Then you can “parallelize” this problem. Divide the random walk into two equal-size paths. Or divide it into 10 equal-size paths.

If you can issue only one memory request at any one time, parallelizing the problem won’t help. However, processors can issue more than one memory request and so as you parallelize the problem, your running times get smaller and your effective bandwidth higher.

How has this evolved over time? The very latest Intel processors (e.g., Cannon Lake), can sustain more than 20 memory requests at any one time. It is about twice what the prior generation (Skylake) could do. How do the latest AMD processor fare? About the same. They can sustain nearly 20 concurrent memory requests at any one time. AMD does not quite scale as well as Intel, but it is close. In these tests, I am hitting RAM: the array is larger than the CPU cache. I am also using huge pages.

The important lesson is that if you are thinking about your computer as a sequential machine, you can be dramatically wrong, even if you are just using one core.

And there are direct consequences. It appears that many technical interviews for engineering positions have to do with linked lists. In a linked list, you access the element of a list one by one, as the location of the next entry is always coded in the current entry and nowhere else. Obviously, it is a potential problem performance-wise because it makes it hard to exploit memory-level parallelism. And the larger your list, the more recent your processor, the worse it gets.

I make the raw results available. I use the testingmlp software package.

Cloud computing: a story of incentives

Many businesses today run “in the cloud”. What this often means is that they have abstracted out the hardware entirely. Large corporations like Amazon, Google, Microsoft or IBM operate the servers. The business only needs to access the software, remotely.

In theory, this means that you can adjust your capacity for just your needs. If you need only twelve servers most of the year, then you pay for only twelve servers. And on the specific days when you need 100 servers, you pay for the 100 servers on these days only. You may even use “serverless” computing and pay just for what you use, saving even more money.

Is this the whole story?

I am not so sure.

A tremendous benefit of cloud computing for the developers and operation people is that it cuts through the red tape. If you are using the cloud, then a developer can get ten more servers at the click of a button. I have met credible people from well-known businesses who told me that their developers have practically an unlimited ability to allocate new cloud resources.

If we make it easy for developers to quickly use lots of computing resources, these developers might effectively think of computing and storage as infinite. It also wipes away all incentives to produce efficient systems.

You may end up warming up the planet. It is not a joke: training a single machine-learning model can have over four times the footprint of a car over its entire lifetime.

Your ten servers end up needing a much more complicated architecture than your single laptop. But that is not obviously a negative for some developers who get to try out fancier-than-needed software, always useful on a resume.

Developers are not alone. When I take pictures these days, I never delete them. They get uploaded to the cloud and I forget about them. When I managed the scarce storage on my PC where digital pictures could be stored, two decades ago, I would spend a lot of time deleting bad pictures. I no longer care.

Further reading: The computing power needed to train AI is now rising seven times faster than ever before and Facebook AI research’s latest breakthrough in natural language understanding is running up against the limits of existing computing power. (credit: Peter Turney)

Science and Technology links (November 16th 2019)

    1. We have new technology to do genetic engineering on human beings (CRISPR). In a small clinical trial, the researchers tested it on live human subjects and found it to be safe.
    2. Girls and boys have a similar intelligence according to psychometric tests, but girls do much better in school, and this is not due to their better verbal ability.
    3. Being illiterate is associated with dementia and cognitive decline in old age.
    4. Dumber people tend to smoke more cannabis, but cannabis itself does not appear to make people dumber.
    5. We transplanted pig skin onto human beings.
    6. A large fraction of Ph.D. students seek psychological help for anxiety and depression. The students often blame the Ph.D. supervisors.
    7. Researchers have apparently been able to reverse aging in the eye of old mice. The mice eyes can then recover from a crushed nerve and from glaucoma.

Unrolling your loops can improve branch prediction

Modern processors predict branches (e.g., if-then clauses), often many cycles a ahead of time. When predictions are incorrect, the processor has to start again, an expensive process. Adding a hard-to-predict branch can multiply your running time.

Does it mean that you should only care about hard-to-predict branches? Unfortunately no.

In a prior blog post, I showed how adding a predictable branch can degrade the branch prediction of other branches.

Conversely, removing a predictable branch might improve branch predictions. Wilco Dijkstra suggested a way to test it out. A common form of predictable branch is a loop. Though a loop does not look like an if-then clause, it still compiles to a branch. A loop is predictable if it iterates long enough. You can “unroll” loops, that is, reduce the number of iterations by doing more work with each iterations. Unrolling does not just remove a predictable branch, but that it does that among other things.

Let us consider an example. I generate random numbers. I check whether they are odd, and when they are, I write the result to an array. It is a silly example meant to illustrate the effect of branch prediction. Because my random numbers can be generated by a pseudo-random number generator, I can run many trials with the same random number. The more often I repeat the loop, the better the branch prediction gets.

  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

We can unroll this loop: instead of generating one random number per loop iteration, I generate two. You can also generate four or eight, and so forth. The unrolled loop has fewer branches because each iteration through the loop is an implicit branch.

  uint64_t randomval;
  while (howmany >= 2) {
    randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    randomval = rng(howmany - 1 + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany-=2;
  }  
  while (howmany != 0) {
    uint64_t randomval = rng(howmany + seed);
    if ((randomval & 1) == 1)
      *out++ = randomval;
    howmany--;
  }

I implemented this benchmark using 1000 random numbers per trial. I record the number of mispredicted branch per random number generated on different processors.

ARM 64-bit (Skylark):

trial basic loop unrolled twice unrolled four times
1 58% 57% 56%
2 48% 33% 26%
3 48% 28% 21%
15 45% 18% 8%

Intel x64 (Cannon Lake):

trial basic loop unrolled twice unrolled four times
1 53% 51% 49%
2 50% 35% 30%
3 46% 20% 17%
15 40% 0.5% 0.4%