Iterating in batches over data structures can be much faster…

We often need to iterate over the content of data structures. It is surprisingly often a performance bottleneck in big-data applications. Most iteration code works one value at a time…

for value in datastructure {
  do something with value
}

There is a request to the data structure for a new value at each iteration. Alternatively, we can query the data structure far less often by asking the data structure to fill a buffer…

for blockofvalues datastructure {
  for value in blockofvalues {
      do something with value
  }
}

It is not automatically faster: you have to store values to a buffer and then read them again. It involves copying data from registers to memory and back. There is some inherent latency and it is an extra step.

However, if you make your buffer large enough but not too large (e.g., 1kB), the latency will not matter much and you will remain in CPU cache (fast memory). Thus you should, in the worst case, be only slightly slower. What do I mean by “slightly”? Basically, you are adding the equivalent of a memory copy over a small buffer.

When accessing data over a network, or even across processes on the same machine, it worth it to process the data in batches because the cost of the transaction is high. When working in data structures that are in your own process, the transaction cost might be low. Repeated function calls in a loop are cheap, and they can become free after inlining. To my knowledge, batched iterations is not typically available in standard libraries.

Thus, until recently, I did not pay much attention to the idea of iterating in batches over data structures. I could imagine some gains, but I expected them to be small.

In the Go implementation of Roaring bitmaps, Ben Shaw contributed a way to iterate over values in batches, recovering many values in a buffer with each function call. It helped the performance considerably (almost doubling the speed on some tests). Richard Startin then did the same in the Java implementation. It also helped a lot:

The batch iterator is always at least twice as fast as the standard iterator (…) Depending on the contents and size of the bitmap, the batch iterator can be 10x faster.

So I started to wonder… is this an underrated strategy?

I modified the popular Go bitset library and on some iteration test, the batched iteration was nearly twice as fast!

The batched code is more complex, but not so terrible:

buffer := make([]uint, 256)
j := uint(0)
j, buffer = bitmap.NextSetMany(j, buffer)
for ; len(buffer) > 0; j, buffer = bitmap.NextSetMany(j, buffer) {
     for k := range buffer {
        // do something with buffer[k]
     }
     j += 1
}

Then I modified the cbitset library. I saw, again, almost a doubling of the speed. The code is once more a bit more complicated:

size_t buffer[256];
size_t howmany = 0;
for(size_t startfrom = 0; 
         (howmany = nextSetBits(b1,buffer,256, &startfrom)) > 0 ;
          startfrom++) {
       for(size_t i = 0; i < howmany ; i++) {
         // do something with  buffer[i];
       }
}

These good results depend on what kind of data you iterate over, how you use the data, and what kind of data structure you have. Obviously, it is useless to batch iterations over an array of values. Yet my few tests provide enough evidence to conclude that batch iteration is worth investigating when speed is a limitation.

On Twitter, Milosz Tanski explained the result as follows:

One thing to remember about CPU and optimization in general is that almost hardware is designed to operate at maximum speed when it’s doing similar work on similar data. Branch prediction, prefetch, caches, op code level parallelization all make this assumption.

Introducing GapminderVR: Data Visualization in Virtual Reality

I am a big fan of sites such as Gapminder and Our World in Data. Such data visualization sites are like intellectual pornography. You want to know which countries are doing better? Which continents drink more alcohol? How is alcohol related to GDP? Have people getting fatter recently, or is that a long trend? You don’t need to read thick books, you can just browse graphs.

I’m also a big fan of virtual reality (VR). In February 2016, I placed a bet against Greg Linden to the effect that by 2019, we would sell at least 10 million VR units a year worldwide. I might very well lose my bet but what is surely correct is that VR is part of our future. It is going to be everywhere, soon… where “soon” remains to be defined, but it is not 30 years.

What if you could mix data visualization and VR? What would happen? Could you make new things that nobody could think of before? I think so. That’s another one of my bets… but unlike my bet with Greg Linden, it is a bet I took by investing time and money in my lab.

To be fair, the idea that you could use data visualization in VR has been around for at least 20 years. It has gone exactly nowhere.

Why would it be different now?

The most obvious factor is cost. VR is much cheaper than it ever were, and it is getting cheaper by the day. And it is not just the hardware. The software is getting better and cheaper. This means that many of us can try new things and iterate much faster than ever before. If we can gather enough heads together, we will get somewhere.

If the work remains limited to a few academics, it is never going to take off. We need to engineers, designers and programmers to jump in.

In my lab, we decided to spend a few months building prototypes of what is possible in VR. We are publishing as demos two interesting cases:

  • Rail of time is a demo where you can control the time dimension with your feet (by walking). Walk forward and the time goes forward. Walk backward and the time goes backward. (YouTube)
  • Museum is a demo where you can visit a museum where the statues represent countries at a given time. The various attributes of the statues represent the various dimensions of the data. (YouTube)

If you have VR headset, you can try our demos in our site: GapminderVR.org. The name of the site is meant to pay homage to Gapminder and to the work of Hans Rosling. All our code is public and you can “steal” it. Or get in touch and we will help you. We hope to inspire future work. If you are interested in helping out, get in touch. If you can do better, please let us know about your work.

The design and programming work was done by Niko Girardelli. He is a super brilliant engineering student, and someone ought to offer him a six-figure job. Yet, to be fair, the programming is less demanding than you might expect. It is all JavaScript in the browser. And yes, the performance is decent. We owe a lot of credit to Mozilla and their work on WebVR. It is amazing.

Science and Technology links (April 13th, 2018)

  1. Somewhat depressingly, there is very little evidence that you can improve people’s overall cognitive abilities:

    Although cognitive ability correlates with domain-specific skills—for example, smarter people are more likely to be stronger chess players and better musicians—there is little evidence that chess or music instruction makes people smarter. Rather, smarter individuals are more likely to engage and excel in these fields.

    That is, if you receive musical training, you may get better at playing an instrument. However, this will not make you better at programming in Java.

    Time spent in school does not significantly improves students’ cognitive skills. However, smarter people spend more time in school.

    This failure to increase overall cognitive skills destroys one of the main premise of schooling. It also means that if you have kids, you should not push them into classes in the hope of making them smarter.

    The same works at the scale of countries. Richer countries educate their people more… but it is not the added education that makes them richer.

    Of course, you can make yourself smarter. Just buy a computer.

  2. A woman’s voice changes after pregnancy. It sounds more masculine, for a time at least.
  3. Women (>45 years old) who consume high-fat dairy products gain less weight over time.
  4. The youngest children in a school class are more likely than their classmates to receive pharmacological treatment for attention deficit/hyperactivity disorder (ADHD). This result suggests that it may be difficult to differentiate immaturity from ADHD.
  5. There are contradictory reports as to whether human beings create new neurons throughout their life. The LA Times has a nice article on neurogenesis: Scientists find signs of new brain cells in adults as old as 79.
  6. In small monkeys, eating less is strongly associated with a longer and healthier life according to a study published in Nature:

    Compared to control animals, caloric restriction extended lifespan by 50% (from 6.4 to 9.6 years, median survival), reduced aging-associated diseases and preserved loss of brain white matter in several brain regions.

    Importantly, this means that aging can be manipulated far more easily than might have been thought. That caloric restriction prolongs life and health has deep consequences. It means that your body does not age because it lacks energy or entropy. It also means that your body does not try to maximize your healthspan and lifespan.

    It is unknown whether caloric restriction has the same effect on human beings. For the record, I do not practice caloric restriction. I eat all the time. I am, however, quite thin and small.

  7. Unsurprisingly, late-life depression is associated with an increased risk for all-cause dementia, vascular dementia and Alzheimer’s disease. This is yet more evidence that we need to take depression seriously. It is not normal to be depressed beyond a certain point.
  8. Kell and Pretorius have published a speculative article where they suggest that many diseases of old age (Alzheimer’s, Parkinson’s, atherosclerosis, arthritis) could be due to dormant microbes that “wake up” later in life, in part due to iron dysregulation. This is not a new theory, but this paper is well written and under open access.

    If you like this theory, you might like Mangan’s book: Dumping Iron. He believes that many of us have too much iron in our bodies. There are two main ways to reduce your iron store: chelation therapy (using drugs) or bleeding out.

    It seems probably that iron dysregulation is real. Sadly, there is weak evidence that you can prevent it. The risk of Parkinson’s disease was higher among men who reported recent multiple blood donations and total iron intake was not associated with an increased risk of Parkinson’s disease.

    The trouble is that iron can take many forms and can go to different places in your body.

For greater speed, try batching your out-of-cache data accesses

In software, we use hash tables to implement sets and maps. A hash table works by first mapping a key to a random-looking address in an array.

In a recent series of blog posts (1, 2, 3), I have documented the fact that precomputing the hash values often accelerates hash tables.

Some people thought that I was merely making the trivial point that precomputing the hash values saved you the time to compute the hash values. That is true, but there is more to it.

On recent Intel processors, batching your load requests can be very helpful. Let me illustrate with some code.

I am going to use a simple hash function that takes integer values and return “mixed up” values:

uint32_t murmur32(uint32_t h) {
  h ^= h >> 16;
  h *= UINT32_C(0x85ebca6b);
  h ^= h >> 13;
  h *= UINT32_C(0xc2b2ae35);
  h ^= h >> 16;
  return h;
}

This function is not very expensive, but it is efficient at generating random-looking outputs.

In a complete loop, it takes between 7 and 8 cycles to compute and store a bunch of these hash values (modulo the array size) using a recent Intel processor.

Let us put this function to good use to randomly go pick up values in a large array and sum them up:

uint64_t sumrandom(uint64_t *values, size_t size) {
  uint64_t sum = 0;
  for (size_t k = 0; k < size; ++k) {
    sum += values[murmur32(k) % size ];
  }
  return sum;
}

You expect that the bulk of the time needed to execute this code would have to do with data accesses (given a large array). And indeed, for arrays exceeding my cache, it takes about 46 cycles per value to compute the sum.

So it means that about 40 cycles are due to the random look-up of the values.

Is that right?

Let us do something more complicated, where we first compute the hash values and then we do the look-ups…

uint64_t sumrandomandindexes(uint64_t *values, uint32_t *indexes, size_t size) {
  uint64_t sum = 0;
  for (size_t k = 0; k < size; ++k) {
    indexes[k] = murmur32(k) % size ;
  }
  for (size_t k = 0; k < size; ++k) {
    sum += values[indexes[k]];
  }
  return sum;
}

This looks more expensive, but it is not. It runs in about 32 cycles per operation. That is, separating the task into two separate tasks, and doing overall more stores and loads, is significantly cheaper. It should not make sense, but the result is robust. (Note that simply unrolling the loop a few times might serve us well, I used an extreme example to make my point.)

operationcost
hash function~8 cycles per value
sum over random values~46 cycles
hash function followed by sum~32 cycles

My code is available. I used GNU GCC compiler 6.0 as it gives the best results on this test.

So what is happening? When I use the Linux perf command to count the number of cache misses (perf stat -B -e cache-misses), I find that the approach the computes the hash values separately from the data loads has about 50% fewer cache misses.

Thus Intel processors have an easier time avoiding cache misses when the data loads are batched.

I have not verified how other processors fare. Do you know?

Science and Technology links (April 7th, 2018)

  1. Mammals have a neocortex, some kind of upper layer on top of our ancestral brain. It is believed to be the key evolutionary trick that makes mammals smart. Yet birds have no cortex, but some of them (parrots and crows) are just as smart as monkeys. Thus Güntürkün and Bugnyar conclude that

    a specific cortical architecture cannot be a requirement for advanced cognitive skills. During the long parallel evolution of mammals and birds, several neural mechanisms for cognition and complex behaviors may have converged despite an overall forebrain organization that is otherwise vastly different.

    That is, the specifics of how mammal brains work are maybe less important than it might appear in our quest to understand intelligence.

  2. Should you go see your doctor each year, just in case? Maybe not:

    Regardless of which screenings and tests were administered, studies of annual health exams dating from 1963 to 1999 show that the annual physicals did not reduce mortality overall or for specific causes of death from cancer or heart disease.

  3. Aspirin is associated with reduced incidence of death from cancers of the colon, lung and prostate. But it increases your risk of bleeding to death.
  4. New neurons can integrate in the adult neocortex, which suggests that progressive cell replacement in the brain is possible.
  5. Up until recently, most laptops had general processors with at most 4 cores. Intel is bringing 6-core processors to some new high-end laptops.
  6. Human beings are unique in their ability to generate new neurons throughout their lifetime (that’s called neurogenesis). It is believed to be important in the process of generating new memories. A new study reports that healthy older people without cognitive impairment generate new neurons as much as younger individuals. In fancy terms, neurogenesis does not decline with aging.
  7. You may move faster when reacting out of instinct than when you are acting deliberately.
  8. Apple recruited John Giannandrea from Google. Giannandrea was responsible for “artificial intelligence” at Google, among other things. Giannandrea looks like a hands-on engineering type. He has repeatedly stated that computers are dumb. He believes that computers should augment human beings, not replace them. It is believed that Giannandrea might help Apple with Siri since it is felt that Siri has stagnated. Giannandrea has a degree in computer science from the University of Strathclyde.
  9. Childless unmarried young women in New York City, Los Angeles and San Diego make 17%, 12% and 15% more than their male peers.

Caching hash values for speed (Swift-language edition)

In my posts Should you cache hash values even for trivial classes? and When accessing hash tables, how much time is spent computing the hash functions?, I showed that caching hash values could accelerate operations over hash tables, sets and maps… even when the hash tables do not fit in CPU cache.

To be clear, I do not mean to say that it is necessarily the computation of the hash values that is the bottleneck, however the whole computation, including the latency of the operations, slows you down more than you would think, even when dealing with out-of-cache data structures.

In my earlier posts, I used the Java programming language. Java already relies on precomputed hash values in its standard library. Indeed, Java strings have precomputed hash values.

It is always prudent to check observations using different programming languages. So I decided to reproduce a similar test using the Swift programming language.

I create a trivial class containing three integer values just like I did in Java:

class Triple :  Hashable, Equatable {
      let x, y, z : Int
      init(x: Int, y:Int, z:Int) {
        self.x = x
        self.y = y
        self.z = z
      }
      final var hashValue: Int {
          return x + 31 &* y + 961 &* z
      }
}

Then I recreate a slightly different version where the value of the hash value is precomputed:

class BufferedTriple :  Hashable, Equatable {
      let x, y, z : Int
      private let hash : Int
      init(x: Int, y:Int, z:Int) {
        self.x = x
        self.y = y
        self.z = z
        hash =  x + 31 &* y + 961 &* z
      }
      final var hashValue: Int {
          return hash
      }
}

My benchmark involves creating a large set of these points, and checking how many of 100 other points are in this set. The code is simple:

for i in a {
          if bs.contains(i) {
            counter += 1
          }
}

I ran this benchmark using Linux, Swift 4.1 and a Skylake processor. Roughly speaking, in my tests, the buffered version (that precomputes the hash values) is about twice as fast:

TripleBufferedTriple
20 us12 us

But maybe it takes much longer creating BufferedTriple instances? In fact, no. It takes about 1.5 us to construct the 100 instances, whether they are Triple and BufferedTriple. It takes slightly more time to construct the BufferedTriple instances but the difference is less than 0.2 us.

My code is available.

My point is not that you should always or often precompute hash values. There are obvious downsides to this memoization approach even if it did not stop the Java engineers from doing it for all strings. Consider this instead: if you think that when working with large, out-of-cache, data structures, computational speed is not important, your mental model of software performance is incomplete.

Further reading: For greater speed, try batching your out-of-cache data accesses

Science and Technology links (March 30th, 2018)

  1. People who score higher on intelligence tests tend to have larger brains. Twin studies suggest the same genetic factors influence both brain size and intelligence.
  2. The effects of campaign contact and advertising on Americans’ candidates choices in general elections is zero. A related fact: in 2016, Clinton has outspent Trump throughout the election by more than $150 million.
  3. In his article “In defense of skepticism about deep learning” (January 2018), Gary Marcus wrote “deep learning by itself, although useful, was unlikely to lead on its own to artificial general intelligence”.
  4. We want to synthesize modified versions of all the genes in the human genome in the next few years” (George Church).
  5. Chinese students get 10% of all doctorates awarded in the U.S. There are more Chinese engineers working on AI at U.S. tech companies than in all of China.
  6. Women score higher in anxiety, vulnerability, altruism, and sympathy while men score higher in excitement-seeking and openness to intellect.
  7. Alcohol consumption is positively associated with handgrip strength in the elderly. Handgrip strength is strongly correlated with life expectancy.
  8. While postdocs are necessary for entry into tenure-track jobs, they do not enhance salaries in other job sectors over time.
  9. Studies in psychology do not hold up to scrutiny:

    We conducted replications of 100 experimental and correlational studies published in three psychology journals using high­-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline.

    While 97% of the original studies had significant results only 36% of the replications had significant results. That is, given a random published study, if you bet that it is bogus, you will win most of the time.

    It also seems no to matter who is doing the research:

    Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise).

  10. You will do better if you believe that performing cognitive work energizes you to do more.
  11. It would have been better to invest all your money in gold back in 2000 than in the stock market. The same is true regarding Bitcoin from its starting point. I don’t know what it means.
  12. Almost all Economics Nobel prizes are closely related through their PhD supervisors. This is yet more evidence that “who you know” is really important to your intellectual productivity.
  13. Canada celebrates its 150th birthday by creating several prestigious research chairs (via Antonio Badia).
  14. Our ancestors had sex with the Neanderthals and the Denisovans.
  15. Obesity is a recent phenomenon:

    Between the birth cohorts of 1930 and 1993, the prevalence of obesity rose from 0 to 14% among boys and from 2 to 12% among girls. (…) Among boys, all these increases began after birth year 1970. Among girls, obesity began to rise after birth year 1980 (…).

    and

    While it may be true that body mass indexes (BMIs) have been increasing slowly for a long time, the increases observed in recent decades are much faster and have pushed many adults and children over the obesity threshold in a remarkably short time. The trend is distressing, but to reverse it we only need to turn the clock back to 1980. We don’t need to go back to 1900.

  16. Rats did not spread the plague:

    While it is commonly assumed that rats and their fleas spread plague during the Second Pandemic, there is little historical and archaeological support for such a claim. Here, we show that human ectoparasites, like body lice and human fleas, might be more likely than rats to have caused the rapidly developing epidemics in pre-Industrial Europe.

  17. Alzheimer’s disease can be spared by nonsteroidal anti-inflammatory drugs. In other words, it could be that aspirin keeps you sharp.

Should you cache hash values even for trivial classes?

Hash tables are a fundamental data structure in computing, used to implement maps and sets. In software, we use hash values to determine where objects are located within hash tables.

In my previous blog post, I showed that the bulk of the running time when checking values in a hash table could be taken up by the computation of the hash values themselves. I got some pushback because I implemented an array of three integers as a Java ArrayList<Integer>. This seems fair to me but there are obviously more efficient ways to represent three integers in Java.

So I decided to do as suggested, and use a trivial class:

class Triple {
  int x;
  int y;
  int z;

  public Triple(int mx, int my, int mz) {
    x = mx;
    y = my;
    z = my;
  }

  public int hashCode() {
    return x + 31 * y + 31 * 31 * z;
  }
}

I also implemented a version with precomputed hash values:

class BufferedTriple {
  int x;
  int y;
  int z;
  int hashcode;

  public BufferedTriple(int mx, int my, int mz) {
    x = mx;
    y = my;
    z = my;
    hashcode =  x + 31 * y + 31 * 31 * z;
  }

  public int hashCode() {
    return hashcode;
  }
}

Given these two classes, I create large hash tables (10 million entries) and I check 100 different target keys.

So is the class with the cached hash value faster? Yes, it is about 25% faster:

TripleBufferedTriple
0.4 us0.3 us

My code is available.

I should add that Java’s hash tables re-hash the hash values that we provide, so the benefits of the cached hash value are less than they could be.

Moreover, Strings have cached hash values by default in Java. I’m definitively not the first to notice that caching hash values can be valuable.

I should add that this was not the point that I wanted to make in my original blog post. I do not particularly care whether you cache your hash values. The point I wanted to make is that even something as cheap as the computation of a hash value can be a limiting factor, even when you have lots of data in RAM.

Further reading: For greater speed, try batching your out-of-cache data accesses

When accessing hash tables, how much time is spent computing the hash functions?

Suppose that you create a large set of objects that you store in a hash table. Let us take 10 million objects. It is large enough that it probably will not fit in the cache memory of your processor. Let us keep the objects simple… say they are lists of three integers.

In Java, it looks like this…

int N = 10000000;
HashSet<List<Integer>> hm = new HashSet<List<Integer>>();
for(int k = 0; k < N; k++) {
     List<Integer> s = new ArrayList<Integer>(3);
     for(int z = 0; z < 3; z++) s.add(k * z - 2);
     hm.add(s);
}

Then you take another set made of a small number of values, say 100 objects. You count how many of these values are within the large set.

int count = 0;
for(List<Integer> st : small_hm) {
    if(s.hm.contains(st)) count++;
}

Because of the way a hash table works, each of your target objects is first hashed, which means that we apply some function that returns a random-like integer. Then we look up the value at an address determined by the hash value in the large hash table.

Java uses a really simple hash function:

int hashCode = 1;
for (Integer e : list)
    hashCode = 31*hashCode + e.hashCode();

Moreover, the hash value of each Integer object is the integer value itself. Thus hashing a list of integers in Java ought to be fast.

Meanwhile, accessing a table that is outside of the CPU cache is potentially expensive. But how expensive? Can we consider the computation of the hash values negligible?

I decided to run a small experiment.

Computing hash values (100)Accessing the large table (10,000,000)
0.9 us0.7 us

Thus, for an out-of-cache hash table in this test, the majority of the time is due to the computation of the hash values! The result is even stronger when the table is reduced in size so that it fits in cache.

There are certainly cases where cache misses will pile up and make anything else look ridiculously expensive. However, you might be more computationally bound than you think.

As an aside, it means that precomputing the hash values of your objects, so that they do not need to be computed each time they are needed, can speed up your code noticeably.

My code is available.

Update: as pointed out by Evan in the comments, the HashSet class will then do extra bit-mixing work on top of what the Java hashCode method returns. It should be considered part of the hash-value computation. Unfortunately, it is not easy to estimate this cost so I have attributed it to the table-access cost. Therefore I underestimate the fraction of the time spent computing the hash value.

Follow-up: See Should you cache hash values even for trivial classes? and For greater speed, try batching your out-of-cache data accesses

When shuffling large arrays, how much time can be attributed to random number generation?

It is well known that contemporary computers don’t like to randomly access data in an unpredictible manner in memory. However, not all forms of random accesses are equally harmful.

To randomly shuffle an array, the textbook algorithm, often attributed to Knuth, is simple enough:

void swap(int[] arr, int i, int j) {
  int tmp = arr[i];
  arr[i] = arr[j];
  arr[j] = tmp;
}

void shuffle_java(int arr[]) {
  ThreadLocalRandom tlc = ThreadLocalRandom.current();
  int size = arr.length;
  for (int i = size; i > 1; i--)
    swap(arr, i - 1, tlc.nextInt(i));
  }
}

Suppose that the array is large. Take an array made of 100 million elements. It far exceeds the CPU cache on the machines I own. Because a random shuffle tends to read data at random locations (by design), you expect many cache misses.

But what fraction of the running time can be attributed to the computation of the random indexes? To answer this question, we can precompute the random indexes and pass them to the shuffle function:

void shuffle_precomp(int arr[], int indexes[]) {
     int size = arr.length;
     for (int i = size; i > 1; i--)
        swap(arr, i - 1, indexes[i]);
}

This saves computations. It also makes it easier for the processor to avoid expensive cache misses because the processor can easily predict (correctly) where the next few reads will be long before it is needed.

To assess the difference, I designed a small benchmark. My results are clear:

Random shufflePrecomputed shuffle
1.9 s0.8 s

The bulk of the running time can be attributed to the generation of the random numbers and the accompanying latency involved.

I am not advocating that you precompute your ranged random numbers in this manner! It is not a good idea in practice. The sole purpose of my test is to show that a significant fraction of the running time of a shuffle function is due to the computation of the random indexes, even when working with large arrays.

For good measure, I also implemented a similar benchmark in C++ based on code from Arseny Kapoulkine. Arseny’s code uses O’Neill’s PCG instead of Java’s LCG, but that is a small difference. On the same machines, I get similar numbers: 1.6 s for the full shuffle and 0.8 s for the precomputed shuffle. So the issue is not specific to Java or to the random number generator it provides.

Java’s thread-local random number generation is a simple linear congruential generator. It is very fast. And I have done a fair amount of work comparing different random number generators. So you cannot make the problem go away by using a “faster” random number generators.

Further reading: You can make shuffle functions faster, I have a whole blog post on this, but it does not involve replacing the random number generator.