Why are unrolled loops faster?

A common optimization in software is to “unroll loops”. It is best explained with an example. Suppose that you want to compute the scalar product between two arrays:

  sum = 0;
  for (i = 0; i < length; i++)
    sum += x[i] * y[i];

An unrolled loop might look as follows:

  sum = 0;
  i = 0;
  if (length > 3)
    for (; i < length - 3; i += 4)
      sum += x[i] * y[i] + x[i + 1] * y[i + 1] +
             x[i + 2] * y[i + 2] + x[i + 3] * y[i + 3];
  for (; i < length; i++)
    sum += x[i] * y[i];

Mathematically, both pieces of code are equivalent. However, the unrolled version is often faster. In fact, many compilers will happily (and silently) unroll loops for you (though not always).

Unrolled loops are not always faster. They generate larger binaries. They require more instruction decoding. They use more memory and instruction cache. Many processors have optimizations specific to small tight loops: manual loop unrolling generating dozens of instructions within the loop tend to defeat these optimizations.

But why would unrolled loops be faster in the first place? One reason for their increased performance is that they lead to fewer instructions being executed.

Let us estimate the number of instructions that we need to be executed with each iteration of the simple (rolled) loop. We need to load two values into registers. We need to execute a multiplication. And then we need to add the product to the sum. That is a total of four instructions. Unless you are cheating (e.g., by using SIMD instructions), you cannot do better than four instructions.

How many instruction do we measure per iteration of the loop? Using a state-of-the-art compiler (GNU GCC 8), I get 7 instructions. Where do these 3 extra instructions come from? We have a loop counter which needs to be incremented. Then this loop counter must be compared with the end-of-loop condition, and finally there is a branch instruction. These three instructions are “inexpensive”. There is probably some instruction fusion happening and other clever optimizations. Nevertheless, these instructions are not free.

Let us grab the numbers on an Intel (Skylake) processor:

amount of unrolling instructions per pair cycles per pair
1 7 1.6
2 5.5 1.6
4 5 1.3
8 4.5 1.4
16 4.25 1.6

My source code is available.

The number of instructions executed diminishes progressively (going toward 4) as the overhead of the loop becomes smaller and smaller due to unrolling. However, the speed, as measured in number of cycles, does not keep on decreasing: the sweet spot is about 4 or 8 unrolling. In this instance, unrolling is mostly beneficial because of the reduced instruction overhead of the loop… but too much unrolling will eventually harm the processing.

There are other potential benefits of loop unrolling in more complicated instances. For example, some loaded values can be carried between loop iterations, thus saving load instructions. If there are branches within the loop, it may help or harm branch prediction to unroll. However, I find that a reduced number of instructions is often in the cards.

Science and Technology links (April 6th 2019)

  1. In a randomized trial where people reduced their caloric intake by 15% for two years, it was found that reducing calories slowed aging. This is well documented in animals, going all the way to worms and insects, but we now have some evidence that it applies to human being as well. Personnally I do not engage in either caloric restriction or fasting, but I am convinced it would be good for me to do so.
  2. What is the likely economic impact of climate change over the coming century? We do not know for sure. However, all estimates point to a modest impact, always significantly less than 10% of the size of the economy over a century while the world’s economy grows at about 3% a year.

    Clearly, 27 estimates are a thin basis for drawing definitive conclusions about the total welfare impacts of climate change. (…) it is unclear whether climate change will lead to a net welfare gain or loss. At the same time, however, despite the variety of methods used to estimate welfare impacts, researchers agree on the order of magnitude, with the welfare change caused by climate change being equivalent to the welfare change caused by an income change of a few percent. That is, these estimates suggest that a century of climate change is about as good/bad for welfare as a year of economic growth.

  3. Is the scientific establishment biased against women or not? Miller reports on new research showing that men tend to reject evidence of bias whereas women tend to reject contrary evidence.
  4. Technology greatly improved the productivity of farming. We are often told that the reason we did not see famines on a massive scale despite earlier predictions to that effect (e.g., by the Club of Rome) is due to the so-called Green Revolution. It is seems that this not well founded on facts:

    We argue a political myth of the Green Revolution focused on averted famine is not well grounded in evidence and thus has potential to mislead to the extent it guides thinking and action related to technological innovation. We recommend an alternative narrative: The Green Evolution, in which sustainable improvements in agricultural productivity did not necessarily avert a global famine, but nonetheless profoundly shaped the modern world.

  5. Sugar does not give your mood a boost. We do not feel more energetic after eating sugar.
  6. Though e-cigarettes are probably incomparably safer than actual cigarettes, people have been banning them on the ground that e-cigarettes might be a gateway toward cigarettes. They are likely wrong. If anything, e-cigarettes are probably a solution for people who have not managed to stop smoking by other means. They have been found to be a highly effective way to stop smoking. Thus e-cigarettes are likely saving lifes; people who ban e-cigarettes despite the evidence should have to answer for the consequences of their choices.
  7. People who think that little boys are more physically aggressive than little girls because of how they are raised are likely wrong.
  8. I am impressed with the courage of these researchers: Oral sex is associated with reduced incidence of recurrent miscarriage (Journal of Reproductive Immunology, 2019).

Science and Technology links (March 30th 2019)

  1. As we age, we accumulate old and useless (senescent) cells. These cells should die, but they do not. Palmer et al. removed senescent cells in obese mice. They found that these mice were less diabetic and just generally healthier. That is, it appears that many of the health problems due to obesity might have to do with the accumulation of senescent cells.
  2. Europe is changing its copyright laws to force websites to be legally responsible for the content that users upload. In my opinion, copyright laws tend to restrict innovation. I also think that Europe is generally not interesting in innovating: where is Europe’s Google or Europe’s Samsung?
  3. China is cloning police dogs.
  4. Do we create new neurons throughout life, or not? It remains a controversial question, but a recent article in Nature seems to indicate that neurogenesis in adult human beings is tangible:

    By combining human brain samples obtained under tightly controlled conditions and state-of-the-art tissue processing methods, we identified thousands of immature neurons in (…) neurologically healthy human subjects up to the ninth decade of life. These neurons exhibited variable degrees of maturation (…) In sharp contrast, the number and maturation of these neurons progressively declined as Alzheimer’s Disease advanced.

  5. Generally speaking, the overall evidence is that fit and healty people tend to be smarter. It is a myth unsupported by science that the gym rat is dumb whereas the pale out-of-shape guy is smart.If you want to be smart, you better stay fit and healthy. Evidently, this suggests that as you age, you may become lose some of your intellectual sharpness.Cornelis et al. processed a large dataset of cognitive tests and they conclude that you are not losing your intelligence very much, at least until you reach a typical retirement age:

    declines in cognitive abilities between the end of the fourth decade and age 65 are small.

    In their experiments, fluid intelligence (basically our reasoning ability) did not change very much and sometimes increased over time. This apparently contradict other studies based on smaller samples, and the authors discuss this apparent contradiction. Reaction time increased with age: older people are slower, everything else being equal.

Java is not a safe language

The prime directive in programming is to write correct code. Some programming languages make it easy to achieve this objective. We can qualify these languages as ‘safe’.

If you write in C++ without good tools, you are definitively in the ‘unsafe’ camp. The people working on the Rust programming language are trying to build a ‘safe language’.

Where does Java lie?

Back when Java was still emerging, I had been tasked with building a new image compression library. I designed a dual Java/C++ library. My client was a company providing medical services, but they had no use for the Java code. To this day, I think that they only use the C++ code.

When I tried to sell them a license to the Java code, I stressed that Java was safer, had automatic memory management and the like. Their top engineer looked at my Java code and spotted a potential memory leak. Yes, Java has memory leaks. You may have been told that it does not happen, but it happens all the time in real systems. We had a beer and a good laugh about it. Meanwhile, he had been able to prove that my C++ code was safe and did not have memory leaks.

In any case, most people would agree that Java is ‘safer’ than C++, but as my story illustrates, it is more of a statistical statement than a black-and-white one.

Is Java a safe language in 2019?

It is a time-dependent culturally-loaded question, but I do not think of Java as a safe language today. If ‘safety’ is your primary concern, then you have better options.

Let me review some examples:

  1. Java does not trap overflows. That is, if you are trying to count how many human beings there are on Earth using a Java ‘int’, incrementing the counter by one each time, the counter will overflow silently and give you a nonsensical result. Languages like Rust and Swift catch overflow. The Java standard library has some functions to guard against overflows, but they are not part of the language. As a related issue, Java promotes and convert types silently and implicitly. Can you guess what the following code will print out?
    short x = Short.MAX_VALUE;
    short y = 2;
    System.out.println(x+y);
    int ix = Integer.MAX_VALUE;
    int iy = 2;
    System.out.println(ix+iy);
    

    This type of behaviour leads to hard-to-catch bugs.

  2. Java allows data races, that is, it is possible in Java to have several threads accessing the same object in memory at the same ‘time’ with one thread writing to the memory location. Languages like Rust do not allow data races. Almost anyone who has programmed non-trivial Java programs has caused or had to debug a data race. It is a real problem.
  3. Java lacks null safety. When a function receives an object, this object might be null. That is, if you see ‘String s’ in your code, you often have no way of knowing whether ‘s’ contains an actually String unless you check at runtime. Can you guess whether programmers always check? They do not, of course, In practice, mission-critical software does crash without warning due to null values. We have two decades of examples. In Swift or Kotlin, you have safe calls or optionals as part of the language. Starting with Java 8, you have Optional objects in the standard library, but they are an afterthought.
  4. Java lacks named arguments. Given a function that takes two integer values, you have to write ‘f(1,2)’. But is it instead ‘f(2,1)’? How do you know that you got the parameters in the right order? Getting confused in the argument order is a cause of hard-to-debug problems. Many modern programming languages have named arguments.

Ultimately, I believe that while some programming languages make it easier to produce correct code than others, much of it comes down to good engineering practices. I would never go as far as saying that programming languages do not matter, but I bet that ‘who’ writes the software is a lot more important.

Hasty comparison: Skylark (ARM) versus Skylake (Intel)

In a previous post, I ran a benchmark on an ARM server and again on an Intel-based server.  My purpose was to indicate that if one function is faster, even much faster, on one processor, you are not allowed to assume that it will also be faster on a vastly different processor. It wasn’t meant to be a deep statement, but even simple facts need illustration. Nevertheless, it was interpreted as an ARM versus Intel comparison.

In the initial numbers that I offered, the ARM Skylark processor that I am using did very poorly compared to the Intel Skylake processor. Eric Wallace explained away the result:  The default compiler on my Linux CentOS machine appears to be unaware of my processor architecture (ARM Aarch64) and, incredibly enough, compiles the code down to 32-bit ARM instructions.

So let us get serious and use a recent compiler (GNU GCC 8) from now on.

And while we are at it, let us do a Skylark versus Skylake, ARM versus Intel, benchmark. I am going to pick three existing C programs from the Computer Language Benchmark Games:

  1. Binarytree is a memory access benchmark. The code constructs binary trees and must traverse them.
  2. Mandelbrot is a number crunching benchmark.
  3. Fasta is a randomized string generation benchmark.

The Skylark processor is from a 32-core box, the reported maximum frequency is 3.3GHz. The Skylake processor is from a 4-core box with a maximal frequency of 4GHz. Here are the numbers I get.

Skylark (ARM) Skylake (Intel)
Binarytree 80 s 16 s
Mandelbrot 15 s 24 s
Fasta 2.0 s 0.8 s

My benchmark is available.

What can we conclude from these numbers? Nothing except maybe that the Skylark box struggles with Binarytree. That benchmark is dominated by the cost of memory allocation/deallocation.

Let me try another benchmark, this time from the cbitset library:

Skylark (ARM) Skylake (Intel)
create 23 ms 4.0 ms
bitset_count 3.2 ms 4.4 ms
iterate 5.0 ms 4.0 ms

The “create” benchmark is basically a memory-intensive test, whereas the two other tests are computational. Again, it seems that ARM server struggles with memory allocations.

Is that something that has to do with the processor or the memory subsystem? Or is it a matter of compiler and standard libraries?

Update: Though the ARM processor has a relatively CentOS distribution, it comes with an older C library. Early testing seems to suggest that this software difference accounts for a sizeable fraction (though not all) of the performance gap between Skylake and Skylark.

Update 2: Using ‘jemalloc’, the ‘Binarytree’ goes from 80 s to 44 s while the ‘create’ timing goes from 23 ms to 13 ms. This gives me confidence that some of the performance gap reported about between Skylake and Skylark is due to software differences.

Technological aging

We are all familiar with biological aging. Roughly speaking, it is the loss of fitness that most animals undergo with time. At the present time, there is simply not much you can do against biological aging. You are just not going to win any gold medals in the Olympics at age 65.

However, not all “aging” in human beings in biological.

There is what I would call “chronological aging”: the trivial fact that, with each passing day, you have been alive one more day. While biological aging might be reversed one day, it is a logical certainty that no amount of technology, except maybe time travel, can reverse chronological aging. Interestingly enough, technology could affect (but not reverse) chronological aging: if you go for a trip at the speed of the light, your chronological aging will be slowed compared to the people you leave behind.

However, much of “aging” is actually social. For example, my children make fun of me since I cannot skateboard. That is true, I never learned to skateboard. However, I am quite convinced that I could learn. I might even like it. However, I am concerned about what people might think if I show up to a skate park with my skateboard.

More interesting to me is “technological aging”. It is the idea that with chronological age, people tend to fail to adopt new technologies up until the point where it becomes too hard for them to catch up.

It goes like this:

  1. You are up-to-date technologically in your teens and twenties.
  2. Some new technology is developed when you are in your thirties or beyond. Maybe its ebooks, ecommerce or Facebook.
  3. At first, this new technology is not very good or it is simply reasonable to consider it with suspicion. So you give it a pass. You choose not to adopt it. In any case, you are doing well with the technology you know.
  4. More new technologies come along, some of them build on the technology you did not adopt. It becomes increasingly tempting to give it a pass. Not only are you missing some of the foundations, but it is new and can be viewed with suspicion.
  5. Finally, after a few decades, you are disconnected technologically, incapable of keeping up.

Observe that both technological aging is somewhat independent from biological aging. That is, imagine a society where we can rejuvenate anyone. So you reach the biological age of 30, and you are stuck there. You never have grey hair. Your skin remains youthful. You would still be able to tell how old someone is just by which technologies they choose to use.

Technological aging is not unique. There is a related concept which I call “cultural aging”. For example, we tend to prefer the music that came out when we were in our teens. I believe that the same effects are at play. New music or new styles come along, but you don’t embrace them because you already have your favorite music. Over time, you become increasingly disconnected.

In any case, the great thing about technological aging is that, unless biological aging, I believe that it is largely reversible. You can adopt ebooks even if you are in your 60s. You can drop cable TV in favour of the Internet. You can stop defending the lecture as a mode of instruction and embrace YouTube and podcasts. However, it takes deliberate effort.

Science and Technology links (March 23rd 2019)

  1. Half of American households subscribe to “Amazon Prime”, a “club membership” for Amazon customers with monthly fees. And about half of these subscribes buy something from Amazon every week. If you are counting, this seems to imply that at least a quarter of all American households order something from Amazon every week.
  2. How do the preprints that researchers post online freely differ from genuine published articles that underwent peer review? Maybe less than you’d expect:

    our results show that quality of reporting in preprints in the life sciences is within a similar range as that of peer-reviewed articles

  3. Very low meat consumption might increase the long-term risk of dementia and Alzheimer’s.
  4. We appear to be no closer to find a cure for Alzheimer’s despite billions being spent each year in research and clinical trials. Lower writes:

    Something is wrong with the way we’re thinking about Alzheimer’s (…) It’s been wrong for a long time and that’s been clear for a long time. Do something else.

  5. Many researchers use “p values” (a statistical measure) to prove that their results are “significant”. Ioannidis argues that most research should not rely on p values.
  6. Eating nuts improves cognition (nuts make you smart).
  7. As we age, we become more prone to diabetes. According to an article in Nature, senescent cells in the immune system may lead to diabetes. Senescent cells that are cells that should be dead due to damage or too many divisions, but they refuse to die.
  8. Hospitalizations for heart attacks have declined by 38% in the last 20 years and mortality is at all time low. Though clinicians and health professionals take the credit, I am not convinced we understand the source of this progress.
  9. In stories, females identify more strongly with their own gender whereas males identify equally with either gender.
  10. Theranos was a large company that pretended to be able to do better blood tests. The company was backed by several granted patents. Yet we know that Theranos technology did not work. The problem we are facing now is that Theranos patents, granted on false pretenses and vague claims, remain valid and will hurt genuine inventors in the future. If we are to have patents at all, they should only be granted for inventions that work. Nazer argues that the patent system is broken.
  11. Smaller groups tend to create more innovative work, and larger groups less so.
  12. The bones of older people become fragile. A leading cause of this problem is the fact stem cells in our bones become less active. It appears that this is caused by excessive inflammation. We can create it in young mice by exposing them to the blood serum of old mice. We can also reverse it in old mice by using an anti-inflammatory drug (akin to aspirin).
  13. Gene therapy helped mice regain sight lost due to retinal degeneration. It could work in human beings too.
  14. Based on ecological models, scientists predicted over ten years ago that polar bear populations would soon collapse. That has not happened: there may be several times more polar bears than decades ago. It is true that ice coverage is lower than it has been historically due to climate change, but it is apparently incorrect to assume that polar bears need thick ice; they may in fact thrive when the ice is thin and the summers are long. Crowford, a zoologist and professor at the University of Victory tells the tale in her book The Polar Bear Catastrophe That Never Happened.

ARM and Intel have different performance characteristics: a case study in random number generation

In my previous post, I reviewed a new fast random number generator called wyhash. I commented that I expected it to do well on x64 processors (Intel and AMD), but not so well on ARM processors.

Let us review again wyhash:

uint64_t wyhash64_x; 


uint64_t wyhash64() {
  wyhash64_x += 0x60bee2bee120fc15;
  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
  uint64_t m1 = (tmp >> 64) ^ tmp;
  tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
  uint64_t m2 = (tmp >> 64) ^ tmp;
  return m2;
}

(Source code)

It is only two multiplications (plus a few cheap operations like add and XOR), but these are full multiplications producing a 128-bit output.

Let us compared with a similar but conventional generator (splitmix) developed by Steele et al. and part of the Java library:

 uint64_t splitmix64(void) {
  splitmix64_x += 0x9E3779B97F4A7C15;
  uint64_t z = splitmix64_x;
  z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9;
  z = (z ^ (z >> 27)) * 0x94D049BB133111EB;
  return z ^ (z >> 31);
}

We still have two multiplications, but many more operation. So you would expect splitmix to be slower. And it is, on my typical x64 processor.

Let me reuse my benchmark where I simply sum up 524288 random integers are record how long it takes…

Skylake x64 Skylark ARM
wyhash 0.5 ms 1.4 ms
splitmix 0.6 ms 0.9 ms

According to my tests, on the x64 processor, wyhash is faster than splitmix. When I switch to my ARM server, wyhash becomes slower.

The difference is that the computation of the most significant bits of a 64-bit product on an ARM processor requires a separate and potentially expensive instruction.

Of course, your results will vary depending on your exact processor and exact compiler.

Note: I have about half a million integers, so if you double my numbers, you will get a rough estimate of the number of nanoseconds per 64-bit integer generated.

Update 1: W. Dijkstra correctly pointed out that wyhash could not, possibly, be several times faster than splitmix in a fair competition. I initially reported bad results with splitmix, but after disabling autovectorization (-fno-tree-vectorize), the results are closer. He also points out that results are vastly different on other ARM processors like Falkor and ThunderX2.

Update 2: One reading of this blog post is that I am pretending to compare Intel vs. ARM and to qualify one as being better than the other one. That was never my intention. My main message is that the underlying hardware matters a great deal when trying to determine which code is fastest.

Update 3. My initial results made the ARM processor look bad. Switching to a more recent compiler (GNU GCC 8.3) resolved the issue.

The fastest conventional random number generator that can pass Big Crush?

In software, we sometimes want to generate (pseudo-)random numbers. The general strategy is to have a state (e.g., a 64-bit integer) and modify it each time we want a new random number. From this state, we can derive a “random number”.

How do you that you have generated something that can pass as a random number? A gold standard in this game is L’Ecuyer’s Big Crush benchmark. It is a set of statistical tests. It is not sufficient to “pass” Big Crush to be a good random number generator, but if you can’t even pass Big Crush, you are in trouble.

When I need a super fast and super simple random number that qualifies, I go for Lehmer’s generator:

__uint128_t g_lehmer64_state;

uint64_t lehmer64() {
  g_lehmer64_state *= 0xda942042e4dd58b5;
  return g_lehmer64_state >> 64;
}

(Source code)

Once compiled for an x64 processor, the generator boils down to two 64-bit multiplication instructions and one addition. It is hard to beat! The catch is that there is non-trivial data dependency between the calls when using the same state with each call: you may need to complete the two multiplications before you can start work on the next function call. Because our processors are superscalar (meaning that they can do several instructions per cycle), it is genuine concern. You can break this data dependency by having effectively two generators, using one and then the other.

Lehmer’s generator passes Big Crush. There are many other fast generators that pass basic statistical tests, like PCG64, xorshift128+, but if you want raw speed, Lehmer’s generator is great.

Recently, a new fast contender has been brought to my attention: wyhash. It is closely related to a family of random number generators and hash functions called MUM and designed by Vladimir Makarov (there is a nearly identical generator by Makarov called mum-prng). The new contender works as follow:

uint64_t wyhash64_x; 


uint64_t wyhash64() {
  wyhash64_x += 0x60bee2bee120fc15;
  __uint128_t tmp;
  tmp = (__uint128_t) wyhash64_x * 0xa3b195354a39b70d;
  uint64_t m1 = (tmp >> 64) ^ tmp;
  tmp = (__uint128_t)m1 * 0x1b03738712fad5c9;
  uint64_t m2 = (tmp >> 64) ^ tmp;
  return m2;
}

(Source code)

Wyhash (and presumably mum-prng) passes rigorous statistical tests.

On an x64 processor, the function generators two multiplications, one addition and two XOR. If you are counting, that’s only two instructions more than Lehmer’s generator. Like generators from the PCG family, wyhash updates its seed very simply, and so you can pipeline the generation of two or more random numbers with minimal data dependency between them: as soon as one addition is completed, you can start work on the second number.

Both of these generators might be relatively less performant on ARM processors due to the high cost of generating the full 128-bit product on ARM architectures. They are also both relatively harder to implement in a portable way.

This being said, which is faster on my x64 processor?

Let us run the experiments. I am going to work over sets of 524288 random numbers. I am using a skylake processor and GNU GCC 8. I make my source code available.

First, I just sum up the random numbers being generated.

wyhash 0.51 ms
Lehmer’s 0.63 ms
Lehmer’s (two gen.) 0.48 ms
Lehmer’s (three gen.) 0.37 ms

From run to run, my margin of error is about 0.02.

Next I am going to store the random numbers in an array.

wyhash 0.6 ms
Lehmer’s 0.6 ms
Lehmer’s (two gen.) 0.6 ms
Lehmer’s (three gen.) 0.4 ms

So using three Lehmer’s generators is best.

Of course, using parallel generators in this manner could be statistically unsafe. One would want to run further tests.

Credit: Monakov suggested going to three generators. The post was updated accordingly.

Further Reading: There were apparently some Hacker News comments on both a new hash function (XXH3) and wyhash.

Credit: Wyhash was invented by Wang Yi.

Update: I have implemented wyhash in Swift.

Don’t read your data from a straw

It is common for binary data to be serialized to bytes. Data structures like indexes can be saved to disk or transmitted over the network in such a form. Many serialized data structures can be viewed as sets of ‘integer’ values. That is the case, for example, of a Roaring bitmap. We must then read back this data. An integer could be serialized as four bytes in a stream of bytes.

We are thus interested in the problem where we want to convert an array of bytes into an array of integer values.

In C or C++, you can safely convert from a stream of bytes to in-memory values using a function like ‘memcpy’. It is fast. Or you can just cheat and do a “cast”.

What do you do in Java?

A convenient approach is to wrap the byte array into an InputStream and then wrap that InputStream into a DataInputStream, like in this example where we convert an array of bytes into an array of integers:

byte[] array = ...
int[] recipient = new int[N / 4];
DataInput di = new DataInputStream(new 
            ByteArrayInputStream(array));
for(int k = 0; k < recipient.length; k++)
    recipient[k] = di.readInt();

The benefit of this approach is improved abstraction: you do not care whether the data comes from an array of bytes or from disk, it is all the same code. If you are have serialization and deserialization code, it is probably written in terms of OutputStream and InputStream anyhow, so why not reuse that perfectly good code?

However, Java offers a performance-oriented concept called a ByteBuffer to represent and array of bytes. It is not as high level as an input stream since it assumes that you do have, somewhere, an array of bytes.

You can achieve the same conversion as before using a ByteBuffer instead:

byte[] array = ...
int[] recipient = new int[N / 4];
ByteBuffer bb = ByteBuffer.wrap(s.array);
bb.asIntBuffer().get(recipient);

Here is the time required to convert 1 million 32-bit integers on my 4.2GHz 2018 iMac:

DataInputStream 10 ms
ByteBuffer 1 ms

That is, the ByteBuffer is 10x faster. My code is available.

Because I have 1 million integers, we can convert back these timings into “time per integer”: the ByteBuffer approach achieves a speed of one 32-bit integer converted per nanosecond. Given that my iMac can execute probably something like a dozen operations per nanosecond, that’s not impressive… but it is at least a bit respectable. The DataInputStream takes 10 nanosecond (or something like 40 or 50 cycles) per integer: it is grossly inefficient.

This has interesting practical consequences. In the RoaringBitmap project, Alvarado pointed out that it is faster to create a bitmap using a ByteBuffer backend, and then convert it back into an normal data structure, than to construct it directly from an input steam. And the difference is not small.

Practically, this means that it may be worth it to provide a special function that can construct a bitmap directly from a ByteBuffer or a byte array, bypassing the stream approach. (Thankfully, we have bitmaps backed with ByteBuffer to support applications such as memory-file mapping.)

Speaking for myself, I was going under the impression that Java would do a fine job reading from a byte array using an input stream. At the very least, we found that ByteArrayInputStream is not the right tool. That a ByteBuffer would be fast is not so surprising: as far as I can tell, they were introduced in the standard API precisely for performance reasons. However, a factor of ten is quite a bit more than I expected.

In any case, it is a perfectly good example of the problem whereas abstractions force you to consume data as if it went through a straw. Streams and iterators are handy abstractions but they often lead you astray with respect to performance.

Further reading. Ondrej Kokes has reproduced these results in Go.