# Greater speed in memory-bound graph algorithms with just straight C code

Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in memory.

In an earlier post, I showed how we could accelerate memory-bound graph algorithms by using software prefetches. We were able to trim a third of the running time just by adding a line of code. But I do not like nor recommend software prefetching.

In my test case, I am trying to find the distance between two nodes by doing a breadth-first search. I use a random graph containing 10 million nodes and where each node has degree 16 (meaning that each node has 16 neighbours). I start from one node, visit all its neighbours, then visit the neighbours of the neighbours, until I arrive at my destination. The pseudocode is as follows:

```insert my starting node in a queue
for each node x in my queue {
for each node y connected to x {
// option: prefetch the node y' that follows y
if (y is my target) {
// success
}
if (y is new) {
}
}
}
```

I can rewrite this code where I visit the nodes in chunks. So I grab 8 nodes, I look at the first neighbour of each of the 8 nodes, then I look at the second neighbour of the 8 nodes and so forth. We interleave neighbours. The pseudocode looks as follows:

```insert my starting node in a queue
// divide the queue into chunks of up to 8 nodes
for each chunk of nodes x1, x2, ..., x8 in my queue {
for index i in 1, 2..., degree {
for x in x1, x2, ..., x8 {
let y be neighbour i of x
if (y is my target) {
// success
}
if (y is new) {
}
}
}
}
```

From a purely computer science perspective, this new pseudocode should not be faster. You will not find it in any textbook I know. Yet it is considerably faster when working over large graphs.

How fast is this new approach? About twice as fast as the naive approach, and faster than the one relying on software prefetches. My source code is available.

 naive breadth-first search 23 cycles per edge visited prefetched breadth-first search 16 cycles per edge visited new approach 12 cycles per edge visited

I like this example because it illustrates quite well that even if a problem is memory-bound, you can still code a faster solution. There is no need for specialized instructions. There is no need to move anything around in memory. How you access the data matters.

I should point out that this is just something I threw together during a lazy Monday. It is probably nowhere near optimal. It is totally lacking in sophistication. Yet it works. Moreover, it is going to be effective on almost all modern computer architectures.

### Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

## 17 thoughts on “Greater speed in memory-bound graph algorithms with just straight C code”

1. Semih Salihoglu says:

This is getting more interesting. Which optimization is kicking in here? Is it the processor sending multiple requests to the memory before doing any computation and that capturing some kind of locality?

Also should this also work in Java? I wrote this Java code quickly that does a 7 step BFS traversal starting from a random source doing it 1-by-1, or by your 8-by-8 trick. The 8-by-8 trick doesn’t seem to help on my Mac OS. I’m using Java 8. I didn’t check your code in detail, I just wrote it based on the pseudocode.

1. Which optimization is kicking in here? Is it the processor sending multiple requests to the memory before doing any computation and that capturing some kind of locality?

Exactly. The idea is that you can replace a software prefetch by a data access that is “ahead of time”. This data access will get stuck, but the processor will work for you, trying to fill it in.

I should warn you that the code above is not optimal. It is a hack to prove the concept.

Also should this also work in Java?

Yes.

I chose C because it will allow me to do things that are not possible in Java. (I am not nearly done!) But what I just did is language independent.

I’ll look at your Java code. (Next week?)

1. Frank Tetzel says:

Your data access is not really “ahead-of-time”, but out-of-order execution kicks in as the accesses in the batch are independent.

There is prior work doing very similar things with prefetching as out-of-order execution has its limits (window size). But I’m not aware of anybody doing it for graph traversals.
All of them utilize memory parallelism to reduce or hide memory latency such that the cache miss penalty is less severe.

group prefetching
https://dl.acm.org/citation.cfm?id=1272747
Asynchronous memory access chaining
http://www.vldb.org/pvldb/vol9/p252-kocberber.pdf
Interleaving with Coroutines
http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf

1. Travis Downs says:

I would be surprised if techniques like this to include MLP on graph algorithms aren’t widely used in the industry. I have looked at very few graph traversal algorithms, but one I did look at a long time ago: the mark part of the JDK’s mark & sweep garbage collector definitely used a strategy to ensure multiple requests were in flight at once: the nodes to visit were kept in a queue, and prefetch requests were issued to the N elements at the head of the queue, to try to keep about N requests in flight at once.

I remember reading it way back then and thinking “that’s clever!”. If you were really sure you were almost always missing to DRAM, you might even apply other tricks such as sorting the nodes in the queue so that they have a more linear access pattern, or visiting them in an order that was otherwise cache or memory controller friendly.

1. It depends what you mean by “the industry”. There is related published work in the field of garbage collection, and I expect state-of-the-art garbage collectors to optimize accordingly.

1. Travis Downs says:

By “industry” I just mean the collection of people writing software for use, whether commercial, open source, internal use, whatever. More or less as opposed to “in the literature” and/or academia.

Above there was some discussion of whether this idea had been explored, and a comment that its applicability to graphs perhaps hadn’t been explored. My claim then is that this has almost certainly been repeatedly explored “in the industry” (i.e., software about which no paper is written, or at least no paper exploring this issue) since one of the few graph algorithms I looked at long ago already used the trick of explicitly managing a queue of future memory accesses and prefetches to get MLP.

Perhaps that “trick” originated in a GC paper, but in my experience that is not the normal direction of flow for such O(1) (i.e., not changing the complexity) tricks.

2. I’m not surprised. Algorithms textbooks assume that the difference between a machine with 220 instructions in flight at a time and 1 instruction in flight is O(1), which of course it is. ðŸ™‚ However, there is a huge speedup that can be had by these sort of transformations if you can break dependencies. I’m working on a Random Forest implementation off-and-on and all the first-order issues I’m having with it are identical to what you discuss, although a binary tree is way easier to deal with than an arbitrary graph and more amenable to optimizations. It is interesting that there is way more work on naive parallelism of this sort of thing (“let’s chuck 8 cores at it”) than there is extracting the parallelism latent in one core. Of course, if you get too good at extracting per-core parallelism you will wind up potentially saturating uncore resources like L3 or the memory bus. But I have plenty of algorithms whose naive versions really do nothing but sit around and wait for L2 all day, and max-ing out L2 utilization is just fine as far as core scaling goes.

3. Scott Hess says:

The nice thing about an explicit prefetch version is that I, the reader, look at it and immediately know “This is designed to target a specific memory hierarchy”. The groups-of-8 version certainly can have a comment which describes why it is that way, but in my experience the more likely thing you’ll find is a piece of code which feels like it’s more complicated than it needs to be, without any explanation of how or why.

1. I don’t disagree but the approach I have taken here is probably both suboptimal and overly complicated. I think you can get both better speed and simpler code with some effort.

4. I’m thinking prefetch and gather are both potentially interesting optimization areas to look at, even though I have never had a particularly fun time with gather. Both address the problem of running out of places to stash data items fetched in a wide pipeline (in radically different ways).

5. wmu says:

I use a random graph containing 10 million nodes and where each node has degree 16 (meaning that each node has 16 neighbours).

I’d love to see results with less regular data, when the degree varies.

1. I’d like to move this to real data, but there is more at the implementation side of things that could be done.

6. Keith Burns says:

VPP has been doing this for 15 years. Well worth a look for some brilliant performant patterns

Wiki.fd.io

1. Well worth a look for some brilliant performant patterns

Can you point to some of these patterns?

7. Marwan Burelle says:

Interesting!

Some years ago, we’ve done tests that somehow may be in line with this.

We were trying to speed-up diameter algorithms (which involve a lot of breadth first traversal) and find out that just renumbering the graph with the order of a bfs can have huge impact. Our guess was that it increases locality. Unfortunately we had other projects more important …

8. cwl says:

Will this work when using openmp ?

You may subscribe to this blog by email.