Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in memory.

In an earlier post, I showed how we could accelerate memory-bound graph algorithms by using software prefetches. We were able to trim a third of the running time just by adding a line of code. But I do not like nor recommend software prefetching.

In my test case, I am trying to find the distance between two nodes by doing a breadth-first search. I use a random graph containing 10 million nodes and where each node has degree 16 (meaning that each node has 16 neighbours). I start from one node, visit all its neighbours, then visit the neighbours of the neighbours, until I arrive at my destination. The pseudocode is as follows:

insert my starting node in a queue for each node x in my queue { for each node y connected to x { // option: prefetch the node y' that follows y if (y is my target) { // success } if (y is new) { add y to next queue } } }

I can rewrite this code where I visit the nodes in chunks. So I grab 8 nodes, I look at the first neighbour of each of the 8 nodes, then I look at the second neighbour of the 8 nodes and so forth. We interleave neighbours. The pseudocode looks as follows:

insert my starting node in a queue // divide the queue into chunks of up to 8 nodes for each chunk of nodes x1, x2, ..., x8 in my queue { for index i in 1, 2..., degree { for x in x1, x2, ..., x8 { let y be neighbour i of x if (y is my target) { // success } if (y is new) { add y to next queue } } } }

From a purely computer science perspective, this new pseudocode should not be faster. You will not find it in any textbook I know. Yet it is considerably faster when working over large graphs.

How fast is this new approach? About twice as fast as the naive approach, and faster than the one relying on software prefetches. My source code is available.

naive breadth-first search | 23 cycles per edge visited |

prefetched breadth-first search | 16 cycles per edge visited |

new approach | 12 cycles per edge visited |

I like this example because it illustrates quite well that even if a problem is memory-bound, you can still code a faster solution. There is no need for specialized instructions. There is no need to move anything around in memory. How you access the data matters.

I should point out that this is just something I threw together during a lazy Monday. It is probably nowhere near optimal. It is totally lacking in sophistication. Yet it works. Moreover, it is going to be effective on almost all modern computer architectures.

This is getting more interesting. Which optimization is kicking in here? Is it the processor sending multiple requests to the memory before doing any computation and that capturing some kind of locality?

Also should this also work in Java? I wrote this Java code quickly that does a 7 step BFS traversal starting from a random source doing it 1-by-1, or by your 8-by-8 trick. The 8-by-8 trick doesn’t seem to help on my Mac OS. I’m using Java 8. I didn’t check your code in detail, I just wrote it based on the pseudocode.

The link got lost for my code: https://github.com/semihsalihoglu-uw/misc/blob/master/java-bfs-opt/ScratchPad.java

Exactly. The idea is that you can replace a software prefetch by a data access that is “ahead of time”. This data access will get stuck, but the processor will work for you, trying to fill it in.

I should warn you that the code above is not optimal. It is a hack to prove the concept.

Yes.

I chose C because it will allow me to do things that are not possible in Java. (I am not nearly done!) But what I just did is language independent.

I’ll look at your Java code. (Next week?)

Your data access is not really “ahead-of-time”, but out-of-order execution kicks in as the accesses in the batch are independent.

There is prior work doing very similar things with prefetching as out-of-order execution has its limits (window size). But I’m not aware of anybody doing it for graph traversals.

All of them utilize memory parallelism to reduce or hide memory latency such that the cache miss penalty is less severe.

group prefetching

https://dl.acm.org/citation.cfm?id=1272747

Asynchronous memory access chaining

http://www.vldb.org/pvldb/vol9/p252-kocberber.pdf

Interleaving with Coroutines

http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf

I’m not surprised. Algorithms textbooks assume that the difference between a machine with 220 instructions in flight at a time and 1 instruction in flight is O(1), which of course it is. š However, there is a huge speedup that can be had by these sort of transformations if you can break dependencies. I’m working on a Random Forest implementation off-and-on and

allthe first-order issues I’m having with it are identical to what you discuss, although a binary tree is way easier to deal with than an arbitrary graph and more amenable to optimizations. It is interesting that there is way more work on naive parallelism of this sort of thing (“let’s chuck 8 cores at it”) than there is extracting the parallelism latent in one core. Of course, if you gettoogood at extracting per-core parallelism you will wind up potentially saturating uncore resources like L3 or the memory bus. But I have plenty of algorithms whose naive versions really do nothing but sit around and wait for L2 all day, and max-ing out L2 utilization is just fine as far as core scaling goes.The nice thing about an explicit prefetch version is that I, the reader, look at it and immediately know “This is designed to target a specific memory hierarchy”. The groups-of-8 version certainly

canhave a comment which describes why it is that way, but in my experience the more likely thing you’ll find is a piece of code which feels like it’s more complicated than it needs to be, without any explanation of how or why.I don’t disagree but the approach I have taken here is probably both suboptimal and overly complicated. I think you can get both better speed and simpler code with some effort.

I’m thinking prefetch and gather are both potentially interesting optimization areas to look at, even though I have never had a particularly fun time with gather. Both address the problem of running out of places to stash data items fetched in a wide pipeline (in radically different ways).

I’d love to see results with less regular data, when the degree varies.

I’d like to move this to real data, but there is more at the implementation side of things that could be done.

VPP has been doing this for 15 years. Well worth a look for some brilliant performant patterns

Wiki.fd.io

Well worth a look for some brilliant performant patternsCan you point to some of these patterns?