A lot of data in the real world can be represented as graphs: you have nodes connected through edges. For example, you are a node in a graph where friendships are edges.

I recently met with professor Semih Salihoglu, an expert in graph databases and algorithms. We discussed fun problem like how one can find the shortest path between two nodes in a very large graph.

Semih argued something to the effect that, often, the best you can do is a breadth-first search. That sounds scary and fancy… but it is actually super simple. Start from a given node. This node is at level 0. Then visit all its neighbors (by iterating through its edges). These nodes are at level 1. Then visit all the nodes connected to the nodes at level 1 (excluding your starting node), these are at level 2. And so forth. The magic is that you will end up visiting once (and exactly once) all nodes that can be reached from your starting node.

With this approach, you can find the distance between any two nodes. Just keep exploring, starting from one of the two nodes, until you encounter the other node.

But what happens if the graph is very large? These nodes you are visiting are all over your memory. This means that each node access is a potential cache fault. Our processors are super fast, and they have super fast memory close to them (cache memory), but your main memory (RAM) is comparatively much slower.

Thus, when processing a large graph, you are likely memory-bound… meaning that much of the time is spent waiting for memory access. It is worse than it appears because memory access is a shared resource in multicore processors, which means that you cannot make this problem go away cheaply by buying a processor with many more cores.

Can we quantify this?

I built a large random graph made of 10 million nodes where each node has 16 random neighbors.

I pick a node at random, seek another node far away, and then I measure the time it takes to do the breadth-first search from one to the other. On a per-edge basis, it takes 23 cycles. Don’t worry, things get much worse if the graph gets larger, but let us reflect on the fact that 23 cycles to merely look at the node identifier, check if it has been visited and if not, add it to our list… is a lot. Not counting memory accesses, we should be able to do this work in 5 cycles or less.

Can we do better than 23 cycles?

What if, right before you start processing the neighbors of one node, you told your processor to go fetch the neighbors of the next node? I have a recent post on this very topic: Is software prefetching (__builtin_prefetch) useful for performance?

In that post, I explained that Intel processors have prefetching instructions that the software can call. I also recommended to avoid them.

So what happens if I add a prefetch instruction to my graph code? I go down to 16 cycles… saving a third of the running time.

naive breadth-first search | 23 cycles per edge visited |

prefetched breadth-first search | 16 cycles per edge visited |

My result would seem to invalidate my recommendation to avoid software prefetches. But keep in mind that my implementation is naive and limited, thrown together in a couple of hours. It is a proof of concept. What it demonstrates is that even if you are limited by memory accesses, there are still software choices you can make to help you.

I would only change my recommendation against software prefetches if we definitively could not rewrite the code differently to get the same benefits. I think we can write more clever code.

There are many problems with software prefetches. In some cases, as is the case here, it is better than nothing… But it is still a fragile hack. It helps in my particular case, but change the parameters of the graph, and things might go to hell. Update your processor and things could go to hell. And there is no way to know whether the exact way I did it is close to optimal… it works well in my case, but it might require much tuning in other instances.

So how can we write the code better? I am not certain yet.

**Follow-up**: Greater speed in memory-bound graph algorithms with just straight C code

Thanks Daniel for the post! Quite interesting to see this potential. An immediate thing to see is what happens on a non-random graph. The benefits might go down. The higher the randomness of connections, the more prefetching should help probably, because the cache locality of the regular (non-prefetching) traversal should be the worst when the connections are totally random. I should play around with your code myself.

There are several algorithmic and data structure-related optimizations for shortest path queries to speed up the vanilla BFS-based solution you started with. Most of the algorithmic and data-structure related optimizations are trying to address the same problem though: often batch graph computations are memory-bound. For example, there are smart ways of assigning node-IDS (e.g., according to a hilbert curve), the compressed sparse row format of storing the edges, or partitioning the neighbors of each node so that each partition fits in the lowest-level CPU caches. These optimizations do not change the total number of edges BFS will read but instead try to increase the CPU cache hit rate when reading the edges from the memory. There is also several algorithmic optimizations for the single-pair shortest-paths problem you took, i.e., when the query has a source and a destination. One well-understood one is to do a bidirectional BFS, one from the source and one from the vertex. This one for example directly decreases the number of edges BFS reads.

I see these optimizations in publications and integrated into many graph processing software. However, I don’t think prefetching-like processor-level optimizations are as well studied (nor integrated into systems I study), so work here would be quite interesting. I’m curious which low-level optimizations are available that can enhance other existing optimizations.

Semih

True. But I also expect that with larger graphs, larger gains are possible. Of course, the challenges also increase.

One definitively wants to use real graphs.

I remembered this approach from our chat, but I deliberately went for something naive.

There is probably quite a bit of optimization possible above and beyond purely algorithmic gains. But it is probably not as simple as spraying prefetch instructions in the code (though, if done right, it might be better than nothing).

One has to be very careful when using the word memory bound in a graph context as memory bound has two very different aspects. There is bandwidth bound and latency bound. Graph traversals like BFS are latency bound that is why prefetching helps. On the other hand, page rank is usually bandwidth bound.

I actually doubt that you can rewrite a graph traversal in such a way that current hardware prefetchers can help. They are optimized for sequential and strided accesses. The accesses of graph traversals are too irregular. As already mentioned CSR and node reordering can improve data locality.

Other proposals add a graph prefetcher in hardware.

http://www-dyn.cl.cam.ac.uk/~tmj32/wordpress/hardware-graph-prefetchers/

This is an interesting point. Whether you’d be latency or bandwidth bound, even in BFS, will depend on the implementation, specifically your parallelism. Say you have ten threads (say on a single core machine) running a parallel BFS from a single source to traverse a large graph, I would expect you’ll be bandwidth bound. A single threaded implementation might be less bandwidth bound, so prefetchers here might help more. So, yes, if we were to parallelize Daniel’s code, the benefits of prefetching will likely go down.

… with the caveat stated in my blog post that memory access is a shared resource on multicore systems…

GC is a similar problem, Effective Prefetch for Mark-Sweep Garbage Collection (http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf)

Thanks for the reference.

There is prior work: Software Prefetching for Mark-Sweep Garbage Collection (2004).