Roaring Bitmaps in JavaScript

Roaring bitmaps are a popular data structure to represents sets of integers. Given such sets, you can quickly compute unions, intersections, and so forth. It is a convenient tool when doing data processing.

I used to joke that Roaring bitmaps had been implemented in every language (Java, C, Rust, Go, and so forth), except JavaScript. This was a joke because I did not expect that one could implement Roaring bitmaps reasonably using JavaScript. I have written a few performance-oriented libraries in JavaScript (FastBitSet.js, FastPriorityQueue.js, FastIntegerCompression.js), but I could not imagine doing a whole Roaring bitmap implementation in JavaScript.

However, it turns out that many people who program in JavaScript run their code under Node.js. And Node.js supports “native addons”… which basically means that you can call a C/C++ library from JavaScript when you are working in Node.js, as long as someone packaged it for you.

And Salvatore Previti did just that for Roaring bitmaps. So you can, say, generate Roaring bitmaps in Java, then load them in Python modify them, and then load them again in Node.js before shipping them your Go program. Not that anyone is doing it, but it is possible, in theory.

Anyhow, how is the performance of Roaring bitmaps in Node.js? We can compare them with JavaScript’s Set data structure as well as against FastBitSet.js.

• suite intersection size
  262144 elements
Set                   33.26 ops/sec 
FastBitSet        14,364.56 ops/sec 
RoaringBitmap32  266,718.85 ops/sec 
  ➔ Fastest is RoaringBitmap32

• suite intersection (in place)
  65536 elements
Set                    199.99 ops/sec 
FastBitSet          93,394.64 ops/sec 
RoaringBitmap32  4,720,764.58 ops/sec 
  ➔ Fastest is RoaringBitmap32

• suite intersection (new)
  1048576 elements
Set                  3.32 ops/sec 
FastBitSet       1,436.14 ops/sec 
RoaringBitmap32  3,557.16 ops/sec
  ➔ Fastest is RoaringBitmap32

• suite union (in place)
  65536 elements
Set                  201.71 ops/sec 
FastBitSet       147,147.28 ops/sec 
RoaringBitmap32  497,687.77 ops/sec
  ➔ Fastest is RoaringBitmap32

• suite union size
  262144 elements
 Set                   22.77 ops/sec
 FastBitSet         7,766.65 ops/sec
 RoaringBitmap32  274,167.71 ops/sec
  ➔ Fastest is RoaringBitmap32

• suite union (new)
  1048576 elements
Set                  1.72 ops/sec
FastBitSet         698.26 ops/sec
RoaringBitmap32  2,033.11 ops/sec 
  ➔ Fastest is RoaringBitmap32

So Roaring bitmaps can be thousands of times faster than a native JavaScript Set. they can can be two orders of magnitude faster than FastBitSet.js. That Roaring bitmaps could beat FastBitSet.js is impressive: I wrote FastBitSet.js and it is fast!

Of course results will vary. No data structure is ever optimal in all use cases. But these numbers suggest that, in some cases, Roaring bitmaps will be useful in JavaScript.

What if you are not using Node.js. Maybe you are running your code in a browser? Salvatore Previti wrote a WebAssembly version as well, basically compiling the C code from the CRoaring library into WebAssembly. The wrapper is still incomplete and it is unclear whether WebAssembly is mature enough to give you good performance, but, one day soon, it might be possible to have fast Roaring bitmaps in the browser.

Science and Technology links (June 15th, 2018)

  1. The market for artificial-intelligence chips could reach $30bn by 2022. My guess is that NVIDIA (a graphics card maker) is the big winner, their tag line is now “Artificial Intelligence Computing Leadership”.
  2. Multiple sclerosis could be caused by polymicrobial infections.
  3. There are fungal infections in the central nervous systems of Alzheimer’s patients, but not in the rest of us.
  4. Google’s translate app can operate offline on a mobile phone using a 40 megabyte language model.
  5. People with severe mental problems are twice as likely to suffer from diabetes.
  6. The DeepMind folks have engineered a piece of software that can infer 3D topology from 2D pictures.

Emojis, Java and Strings

Emojis are funny characters that are becoming increasingly popular. However, they are probably not as simple as you might thing when you are a programmer. For a basis of comparison, let me try to use them in Python 3. I define a string that includes emojis, and then I access the character at index 1 (the second character):

>>> x= "😂😍🎉👍"
>>> len(x)
4
>>> x[1]
'😍'
>>> x[2]
'🎉'

This works well. This fails with Python 2, however. So please upgrade to Python 3 (it came out ten years ago).

What about Java and JavaScript? They are similar but I will focus on Java. You can define the string just fine…

String emostring ="😂😍🎉👍";

However, that’s where troubles begin. If you try to find the length of the string (emostring.length()), Java will tell you that the string contains 8 characters. To get the proper length of the string in terms of “unicode code points”, you need to type something like emostring.codePointCount(0, emostring.length()) (this returns 4, as expected). Not only is this longer, but I also expect it to be much more computationally expensive.

What about accessing characters? You might think that emostring.charAt(1) would return the second character (😍), but it fails. The problem is that Java uses UTF-16 encoding which means, roughly, that unicode characters can use one 16-bit word or two 16-bits, depending on the character. Thus if you are given a string of bytes, you cannot tell without scanning it how long the string is. Meanwhile, the character type in Java (char) is a 16-bit word, so it cannot represent all Unicode characters. You cannot represent an emoji, for example, using Java’s char. In Java, to get the second character, you need to do something awful like…

new StringBuilder().appendCodePoint(
  emostring.codePointAt(emostring.offsetByCodePoints(0, 1))).toString()

I am sure you can find something shorter, but that is the gist of it. And it is far more expensive than charAt.

If your application needs random access to unicode characters in a long string, you risk performance problems.

Other language implementations like PyPy use UTF-32 encoding which, unlike Java’s UTF-16 encoding, supports fast random access to individual characters. The downside is increased memory usage. In fact, it appears that PyPy wants to move to UTF-8, the dominant format on the Web right now. In UTF-8, characters are represented using 1, 2, 3 or 4 bytes.

There are different trade-offs. If memory is no object, and you expect to use emojis, and you want fast random access in long strings, I think that UTF-32 is superior. You can still get some good performance with a format like UTF-8, but you will probably want to have some form of indexing for long strings. That might be difficult to implement if your language does not give you direct access to the underlying implementation.

More annoying than performance is just the sheer inconvenience to the programmer. It is 2018. It is just unacceptable that accessing the second emoji in a string of emojis would require nearly undecipherable code.

Appendix: Yes. I am aware that different code points can be combined into one visible character so that I am simplifying matters by equating “character” and “code point”. I am also aware that different sequences of code points can generate the same character. But this only makes the current state of programming even more disastrous.

Science and Technology links (June 9th, 2018)

  1. A woman with late-stage breast cancer has been successfully cured using immunotherapy. She was preparing to die. She is now going to live hopefully many more years in good health.
  2. Deep learning generalizes poorly, in at least some cases:

    (…) when images are scaled to half their original size and randomly translated, the accuracy of deep learning is less than 50%. For comparison, the previous methods achieved over 70% accuracy, and this was considered poor performance.

  3. People with high IQ (a measure of intelligence) as well as people with fast reaction times are less likely to die.
  4. Looking young is a sign of good health: Perceived age—which is widely used by clinicians as a general indication of a patient’s health—is a robust biomarker of ageing that predicts survival.
  5. While treating various neurodegenerative disorders with stem cells, we accidentally observed a transition of hair colorations from mostly grey to black in 4 elderly patients.
  6. Ibuprofen (a common analgesic) depress testicular function, including testosterone production.
  7. Analogue non-volatile memory can accelerate the neural-network training by two orders of magnitude.
  8. Some antibodies blocks inflammation, protecting mice from hardened arteries and liver disease:

    The antibody also prolonged the life of the mice. After 15 months, all of the antibody-producing mice were alive, compared to 54 percent of the control mice.

  9. If robots are stealing our jobs, they are doing a poor job at it. There are more job openings in the US than there are people available. I’m no economist, but I’d expect to see raising wages.
  10. Asians eat less rice as they get richer.

Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)

Most people designing random number generators program using regular code. If they are aiming for speed, they probably write functions in C. However, our processors have fast “vectorized” (or SIMD) instructions that can allow you to go faster. These instructions do several operations at once. For example, recent Skylake-X processors from Intel can do eight 64-bit multiplications with a single instruction.

There is a vectorized version of the Mersenne Twister generator used in some C++ standard libraries, but the Mersenne Twister is not particularly fast to begin with.

What happens if we vectorize really fast generators?

I had good luck vectorizing Vigna’s xorshift128+ random number generator. A generator like it is part of some JavaScript engines. The xorshift128+ generator produces 64-bit values, but you can consider them as two 32-bit values. On my Skylake-X processor, I can generate 32-bit random integers at a rate of 2 cycles per integer using xorshift128+. That’s almost twice as fast as when using the default, scalar implementation in C.

scalar xorshift128+3.6 cycles per 32-bit word
SIMD xorshift128+1.0 cycles per 32-bit word

PCG is a family of random number generators invented by O’Neill. Can we do the same with PCG? I took a first stab at it with my simdpcg library. My vectorized PCG ends up using about 1 cycle per integer, so it has the same speed as the vectorized xorshift128+.

scalar PCG4.3 cycles per 32-bit word
SIMD PCG1.0 cycles per 32-bit word

In these tests, I simply write out the random number to a small array in cache. I only measure raw throughput. To get these good results, I “cheat” a bit by interleaving several generators in the vectorized versions. Indeed, without this interleave trick, I find that the processor is almost running idle due to data dependencies.

My C code is available:

Sadly, I expect that most of my readers do not yet have processors with support for AVX-512 instructions. They are available as part of the Skylake-X and Cannonlake processors. Intel has not been doing a great job at selling these new processors in great quantities. You may be able to have access to such processors using the cloud.

Update: In my initial version, I reported worse performance on xorshift128+. Samuel Neves pointed out to me that this is due to the lack of inlining, because I compile the xorshift128+ functions in their own object files. We can solve this particular problem using link time optimization (LTO), effectively by passing the -flto flag as part of the compile command line. As usual, results will vary depending on your compiler and processor.

Science and Technology links (June 2nd, 2018)

  1. Human hearts do not regenerate. Cardiovascular diseases are the leading cause of death in occident. Japanese doctors will graft sheets of tissue derived from reprogrammed stem cells in the hope of healing diseased human hearts. It is a small and uncertain trial.
  2. The surface of your eyes (cornea) may become damaged following an accident or a disease. You can receive a transplant, but it is difficult. On an experimental basis, we are now printing new corneas.
  3. Viagra is out of patent. You can get it cheaper. Viagra might be good for your heart. It also raises testosterone which might help you get more muscular. (I don’t take viagra, but maybe I should.)
  4. Is it a good investment to give a lot of money to a great researcher? Not really: beyond a certain point, you are wasting your money. The scientific return on research reaches a maximum at around $400,000 of annual support per researcher. Funders should limit both the maximum amount of funding per researcher. Maybe you think that $400,000 is a lot, but it is not difficult for some academics to accumulate a lot more funding due to the Matthew effect.
  5. Yellow spots in the eyes could be a symptom of dementia.
  6. Maybe unsurprisingly, you can buy your way into Harvard:

    Before he married Ivanka Trump and became a top adviser to the President, Jared Kushner was well-known for the rumors surrounding the circumstances of his admission to Harvard. Jared’s father, Charles, a New Jersey real-estate developer, donated $2.5 million to the college shortly before Jared applied. Officials at Jared’s high school told Golden that they were shocked he got in (…) And they were especially disappointed that he was admitted while his high-school classmates who were better-qualified on the merits got rejected.

    The whole point of places like Harvard is that it is hard to get in. But it is apparently much easier if you are rich and connected.

    Logically, this means that a Harvard degree should have much more weight if you are from a modest background.

  7. Donald Trump wears expensive clothes, he is rich, he graduated from a top-5 school, and he is not modest. It works for him.

    Why are some people modest? Shouldn’t you always broadcast how great you are? But advertising or burying your signals is itself a signal (a meta-signal). So if you went to Harvard, walking around with a Harvard hat and Harvard shirt is maybe less wise than it sounds.

    The really cool and powerful people don’t need to flaunt their accomplishments. If you do much flaunting, you are signaling that you are struggling with your social status.

    And some apparently negative signals can be positive ones, as they can be used to display confidence. For example, modest clothes can be a signal of power in some contexts.

    I suspect that your signals determine who you get to work with. So they are both tremendously important but also quite complicated.

  8. Reviewers are much more likely to recommend acceptance of papers from top institutions. That is, given the exact same paper, if you are told that it is from a top school, you are much more likely to recommend it, even if everything else is kept the same.

    There is an interesting consequence to this observation. When you are given a list of accepted papers at a selective event, look for the contributions from people at non-elite institutions. Their work is probably much more solid since there is a bias working against them and they are the survivors of a more difficult game.

  9. Following the second world war, copyright was temporary waived on German books. This lead them to become much more cited. It should be obvious copyright prevents access. At a civilization scale, this has a significant cost. For example, medical doctors can’t easily access the latest research.
  10. Conclusive evidence for the benefit of any supplement has not demonstrated. Please don’t take multivitamins, at best they are only a waste of time and money. At worse, they are harmful.
  11. Google is working for the military. Many of its engineers are unhappy. I don’t understand why Google does this. Why risk annoying its best engineers, possibly getting them to seek offers elsewhere? Yes, I know that it must be lucrative, but Google has enormous cash reserves: billions and billions that it does not know how to spend. Google’s most valuable asset is not its money. So I am guessing that Google is after more than money.
  12. Only half of kids ages 13-17 use Facebook. Snapchat, Instagram, and YouTube are more popular. Instagram is owned by Facebook.

Greater speed in memory-bound graph algorithms with just straight C code

Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in memory.

In an earlier post, I showed how we could accelerate memory-bound graph algorithms by using software prefetches. We were able to trim a third of the running time just by adding a line of code. But I do not like nor recommend software prefetching.

In my test case, I am trying to find the distance between two nodes by doing a breadth-first search. I use a random graph containing 10 million nodes and where each node has degree 16 (meaning that each node has 16 neighbours). I start from one node, visit all its neighbours, then visit the neighbours of the neighbours, until I arrive at my destination. The pseudocode is as follows:

insert my starting node in a queue
for each node x in my queue {
  for each node y connected to x {
    // option: prefetch the node y' that follows y
    if (y is my target) { 
      // success
    } 
    if (y is new) {
       add y to next queue
    } 
  }
}

I can rewrite this code where I visit the nodes in chunks. So I grab 8 nodes, I look at the first neighbour of each of the 8 nodes, then I look at the second neighbour of the 8 nodes and so forth. We interleave neighbours. The pseudocode looks as follows:

insert my starting node in a queue
// divide the queue into chunks of up to 8 nodes 
for each chunk of nodes x1, x2, ..., x8 in my queue {
  for index i in 1, 2..., degree {
     for x in x1, x2, ..., x8 {
       let y be neighbour i of x
       if (y is my target) { 
         // success
       } 
       if (y is new) {
         add y to next queue
       } 
     }
  }
}

From a purely computer science perspective, this new pseudocode should not be faster. You will not find it in any textbook I know. Yet it is considerably faster when working over large graphs.

How fast is this new approach? About twice as fast as the naive approach, and faster than the one relying on software prefetches. My source code is available.

naive breadth-first search23 cycles per edge visited
prefetched breadth-first search16 cycles per edge visited
new approach12 cycles per edge visited

I like this example because it illustrates quite well that even if a problem is memory-bound, you can still code a faster solution. There is no need for specialized instructions. There is no need to move anything around in memory. How you access the data matters.

I should point out that this is just something I threw together during a lazy Monday. It is probably nowhere near optimal. It is totally lacking in sophistication. Yet it works. Moreover, it is going to be effective on almost all modern computer architectures.

Science and Technology links (May 26th, 2018)

  1. Teicholz argues that Nutrition Science is Not Up to the Task:

    Despite methodological advances, nutritional epidemiology remains fundamentally limited by its observational nature. Guidelines relying on this circumstantial evidence can be little more than educated guesses. Enacting prevention guidelines based on lesser evidence risks repeating what might be a tragic history of doing more harm than good.

  2. Contrary to earlier research, a new article finds that teaching kids to delay gratification is likely a waste of time. The whole concept of “emotional intelligence” is probably much less solid than was believed.
  3. As more Chinese have left farms in the countryside to work in factory cities, the suicide rate has plummeted.
  4. Excess of inflammation is the key ingredient in many nasty diseases. A recent paper suggests that eating baking soda could anti-inflammatory. Drinking water filled with baking soda is not pleasant.
  5. Many people take statins, a wide family of drugs that are supposed to reduce your risk of having a cardiovascular incident. It seems that part of their beneficial effect could be that they neutralize your stem cells… which helps reduce the production of undesirable cells, but also leads to what can be described as accelerated aging…

    statins impaired the osteogenic and chondrogenic differentiation potential of stem cells and increased cell senescence and apoptosis (…) Statins also impaired the expression of DNA repair genes

  6. We still don’t know how life arose on Earth. Steele et al. argue that life come from space

    Even if we concede that the dominant neo-Darwinian paradigm of natural selection can explain aspects of the evolutionary history of life once life gets started, independent abiogenesis on the cosmologically diminutive scale of oceans, lakes or hydrothermal vents remains a hypothesis with no empirical support and is moreover unnecessary and redundant. With astronomical data now pointing to the existence of hundreds of billions of habitable planets in our galaxy alone, such a hypothesis seeking an independent origin of life on any single planet seems to be no longer hardly necessary.

    Their theory is that Earth is bombarded by genetic material, viruses and bacteria… from space. They appear to have interesting evidence:

    Grebennikova et al (2018) have now confirmed the discovery of several microbial species associated with cosmic dust on the exterior windows of the International Space Station (ISS), and contamination at source and in the laboratory has been ruled out. The results of PCR amplification followed by DNA sequencing and phylogenetic analysis have established the presence of bacteria of the genus Mycobacteria and the extremophile genus Delftia, amongst others, associated with deposits of cosmic dust, which are now from a height of some 400km above the Earth’s surface. A terrestrial origin seems most unlikely. Studies by Wickramasinghe and Rycroft (2018) have shown that all possible mechanisms for lofting these organisms against gravity to heights of 400km in the ionosphere fall short by many orders of magnitude.

  7. When part of your brain dies, it is believed that it does not get repaired. Instead, your brain adapts to the loss. Researchers have found some evidence that the brain could be tricked into repairing itself. They have injected a degradable gel that provides support for cells to come and grow. In mice, this lead to what seems to be brain repair.
  8. Contrary to expectations high blood sugar appears to be more of a concern for athletes then low blood sugar even in those with the highest energy expenditure and consuming below the recommended carbohydrate intake.
  9. Males were more likely to take shortcuts and reached their goal location faster than females, while females were more likely to follow learned routes and wander. (…) This research indicates that the sex difference in navigation efficiency is large.
  10. Richard Cocks on research:

    It is the very nature of research that it is not possible to list Goals and Objectives in advance. Or rather, one might have Goals and Objectives but they must be provisional and change as research progresses. If the outcome of research were known in advance no research would be necessary. By definition, what one will discover is a mystery. (…) Since temporary enthusiasms are not predictable or possible to artificially generate, nor to know in advance what one will find, it is not possible to know what direction research will go in. There is simply no point in forcing someone to think about a problem that he has no interest in; not if the goal is to be creative.

    (Found via John D Cook.)

  11. Sarah Constantin writes on cancer research:

    The whole cycle, from no chemotherapies at all to development, trial, and FDA approval for multiple chemotherapy drugs, took just six years. Modern developments, by contrast, can take decades to get to market. The type of research that gave us chemotherapy could never receive funding—and would likely get its practitioners thrown in jail—if it were attempted today.

    The problem is clear: Despite tens of billions of dollars every year spent on research, progress in combating cancer has slowed to a snail’s pace. So how can we start to reverse this frustrating trend? We know from history that cancer research doesn’t need to cost billions to be effective. Instead of open-ended grants, donors could pay for results via contracts or prizes.

    We used to think cancer was conquerable. Today, that idea is often laughed off as utopian. But there are countless reasons to believe that progress has slowed because of organizational and governmental problems, not because the disease is inherently incurable.

  12. CO2 makes plants grow faster and larger, but it does not necessarily make them better for human consumption.
  13. Google worked with LG to develop a display panel providing an astonishing 18.1 megapixels of detail per eye in a virtual-reality headset. That’s an insane number of pixels, and we do not have a headset able to make use of these displays. I suppose that the idea is to add eye tracking so that we can offer high-resolution details where the user is looking. My current take on virtual reality is that we are not primarily limited by the quality of the hardware, but rather by the lack of great software. Still, it is nice to know that better hardware is on the horizon.

Gender and peer review

Modern science works in the following manner. You do the research. You write a paper. You publish the paper. For historical reasons, “publishing the paper” typically means “submit it to a committee of your peers and get their seal of approval”.

We can rightfully be concerned about the unwanted biases that this committee of peers might have. For example, maybe these peers have unconscious biases against women?

It is not unreasonable. We have evidence that peer review is strongly biased in favour of authors from prestigious institutions. For example, Peter and Ceci took already accepted articles from authors at prestigious institutions, and they resubmitted them as authors from lesser institutions: they encountered an 89% rejection rate.

Okike et al. found that

reviewers were more likely to recommend acceptance when the prestigious authors’ names and institutions were visible (single-blind review) than when they were redacted (double-blind review) (87% vs 68%) and also gave higher ratings for the methods and other categories.

That, by itself, is not as worrying as it seems. It is a fact that Stanford researchers are better, on average, than researchers from that school you never heard from. Everything else being equal, it makes sense to put more trust in work from Stanford.

But gender is something else. We have no rational reason to trust the work done by men more than the work done by women.

So is peer review sexist? Webb et al. in Does double-blind review benefit female authors? write:

We found a significant interaction between gender and time (P < 0.0001), reflecting the higher female authorship post-2001 than pre-2001, but there was no significant interaction between gender and review type (...) Budden and colleagues call for the ecological and evolutionary community to revisit the issue of peer review in light of their study of BE, and their claim that double-blind review benefits female authors (...) This is despite the fact that the only evidence they supply is an increase in female first authorship in a single journal (an increase also seen in other journals that did not switch to double-blind review) rather than anything more compelling, such as a demonstration that the ratio of accepted to submitted manuscripts had increased for females relative to males after the introduction of double-blind review.

Ceci and Williams review the evidence regarding biases against women

The preponderance of evidence, including the best and largest studies, indicates no discrimination in reviewing women’s manuscripts

Sugimoto et al. find that

(…) recent meta-analysis suggests that claims of gender bias in peer review “are no longer valid”. For example, if there is gender bias in review, we would expect double-blind conditions to increase acceptance rates for female authors. However, this is not the case. Nor are manuscripts by female authors disproportionately rejected at single-blind review journals. Even when the quality of submissions is controlled for, manuscripts authored by women do not appear to be rejected at a higher rate than those authored by men. Meta-analyses and large-scale studies of grant outcomes found no gender differences after adjusting for factors such as discipline, country, institution, experience, and past research output.

Tomkins et al. found that

the influence of author gender on bidding or reviewing behavior is not statistically significant.

The result extends to grant reviews:

We found no evidence that White male principal investigators received evaluations that were any better than those of principal investigators from the other social categories, and this conclusion was robust to a wide array of model specifications. Supplemental analyses suggest that any bias that is present is likely below the threshold of pragmatic importance.

Li and Agha found in Big names or big ideas that grant reviews are fair:

We find that better peer-review scores are consistently associated with better research outcomes and that this relationship persists even when we include detailed controls for an investigator’s publication history, grant history, institutional affiliations, career stage, and degree types

It even seems that women are slightly better off:

Our results indicate that female principal investigators (PIs) receive a bonus of 10% on scores, in relation to their male colleagues.

So it seems that we have good news. Peer review is not, statistically speaking, sexist.

Graph algorithms and software prefetching

A lot of data in the real world can be represented as graphs: you have nodes connected through edges. For example, you are a node in a graph where friendships are edges.

I recently met with professor Semih Salihoglu, an expert in graph databases and algorithms. We discussed fun problem like how one can find the shortest path between two nodes in a very large graph.

Semih argued something to the effect that, often, the best you can do is a breadth-first search. That sounds scary and fancy… but it is actually super simple. Start from a given node. This node is at level 0. Then visit all its neighbors (by iterating through its edges). These nodes are at level 1. Then visit all the nodes connected to the nodes at level 1 (excluding your starting node), these are at level 2. And so forth. The magic is that you will end up visiting once (and exactly once) all nodes that can be reached from your starting node.

With this approach, you can find the distance between any two nodes. Just keep exploring, starting from one of the two nodes, until you encounter the other node.

But what happens if the graph is very large? These nodes you are visiting are all over your memory. This means that each node access is a potential cache fault. Our processors are super fast, and they have super fast memory close to them (cache memory), but your main memory (RAM) is comparatively much slower.

Thus, when processing a large graph, you are likely memory-bound… meaning that much of the time is spent waiting for memory access. It is worse than it appears because memory access is a shared resource in multicore processors, which means that you cannot make this problem go away cheaply by buying a processor with many more cores.

Can we quantify this?

I built a large random graph made of 10 million nodes where each node has 16 random neighbors.

I pick a node at random, seek another node far away, and then I measure the time it takes to do the breadth-first search from one to the other. On a per-edge basis, it takes 23 cycles. Don’t worry, things get much worse if the graph gets larger, but let us reflect on the fact that 23 cycles to merely look at the node identifier, check if it has been visited and if not, add it to our list… is a lot. Not counting memory accesses, we should be able to do this work in 5 cycles or less.

Can we do better than 23 cycles?

What if, right before you start processing the neighbors of one node, you told your processor to go fetch the neighbors of the next node? I have a recent post on this very topic: Is software prefetching (__builtin_prefetch) useful for performance?

In that post, I explained that Intel processors have prefetching instructions that the software can call. I also recommended to avoid them.

So what happens if I add a prefetch instruction to my graph code? I go down to 16 cycles… saving a third of the running time.

naive breadth-first search23 cycles per edge visited
prefetched breadth-first search16 cycles per edge visited

My code is available (in C).

My result would seem to invalidate my recommendation to avoid software prefetches. But keep in mind that my implementation is naive and limited, thrown together in a couple of hours. It is a proof of concept. What it demonstrates is that even if you are limited by memory accesses, there are still software choices you can make to help you.

I would only change my recommendation against software prefetches if we definitively could not rewrite the code differently to get the same benefits. I think we can write more clever code.

There are many problems with software prefetches. In some cases, as is the case here, it is better than nothing… But it is still a fragile hack. It helps in my particular case, but change the parameters of the graph, and things might go to hell. Update your processor and things could go to hell. And there is no way to know whether the exact way I did it is close to optimal… it works well in my case, but it might require much tuning in other instances.

So how can we write the code better? I am not certain yet.

Follow-up: Greater speed in memory-bound graph algorithms with just straight C code