A lot of data in the real world can be represented as graphs: you have nodes connected through edges. For example, you are a node in a graph where friendships are edges.

I recently met with professor Semih Salihoglu, an expert in graph databases and algorithms. We discussed fun problem like how one can find the shortest path between two nodes in a very large graph.

Semih argued something to the effect that, often, the best you can do is a breadth-first search. That sounds scary and fancy… but it is actually super simple. Start from a given node. This node is at level 0. Then visit all its neighbors (by iterating through its edges). These nodes are at level 1. Then visit all the nodes connected to the nodes at level 1 (excluding your starting node), these are at level 2. And so forth. The magic is that you will end up visiting once (and exactly once) all nodes that can be reached from your starting node.

With this approach, you can find the distance between any two nodes. Just keep exploring, starting from one of the two nodes, until you encounter the other node.

But what happens if the graph is very large? These nodes you are visiting are all over your memory. This means that each node access is a potential cache fault. Our processors are super fast, and they have super fast memory close to them (cache memory), but your main memory (RAM) is comparatively much slower.

Thus, when processing a large graph, you are likely memory-bound… meaning that much of the time is spent waiting for memory access. It is worse than it appears because memory access is a shared resource in multicore processors, which means that you cannot make this problem go away cheaply by buying a processor with many more cores.

Can we quantify this?

I built a large random graph made of 10 million nodes where each node has 16 random neighbors.

I pick a node at random, seek another node far away, and then I measure the time it takes to do the breadth-first search from one to the other. On a per-edge basis, it takes 23 cycles. Don’t worry, things get much worse if the graph gets larger, but let us reflect on the fact that 23 cycles to merely look at the node identifier, check if it has been visited and if not, add it to our list… is a lot. Not counting memory accesses, we should be able to do this work in 5 cycles or less.

Can we do better than 23 cycles?

What if, right before you start processing the neighbors of one node, you told your processor to go fetch the neighbors of the next node? I have a recent post on this very topic: Is software prefetching (__builtin_prefetch) useful for performance?

In that post, I explained that Intel processors have prefetching instructions that the software can call. I also recommended to avoid them.

So what happens if I add a prefetch instruction to my graph code? I go down to 16 cycles… saving a third of the running time.

naive breadth-first search | 23 cycles per edge visited |

prefetched breadth-first search | 16 cycles per edge visited |

My result would seem to invalidate my recommendation to avoid software prefetches. But keep in mind that my implementation is naive and limited, thrown together in a couple of hours. It is a proof of concept. What it demonstrates is that even if you are limited by memory accesses, there are still software choices you can make to help you.

I would only change my recommendation against software prefetches if we definitively could not rewrite the code differently to get the same benefits. I think we can write more clever code.

There are many problems with software prefetches. In some cases, as is the case here, it is better than nothing… But it is still a fragile hack. It helps in my particular case, but change the parameters of the graph, and things might go to hell. Update your processor and things could go to hell. And there is no way to know whether the exact way I did it is close to optimal… it works well in my case, but it might require much tuning in other instances.

So how can we write the code better? I am not certain yet.

**Follow-up**: Greater speed in memory-bound graph algorithms with just straight C code

Daniel Lemire, "Graph algorithms and software prefetching," in *Daniel Lemire's blog*, May 24, 2018.

Thanks Daniel for the post! Quite interesting to see this potential. An immediate thing to see is what happens on a non-random graph. The benefits might go down. The higher the randomness of connections, the more prefetching should help probably, because the cache locality of the regular (non-prefetching) traversal should be the worst when the connections are totally random. I should play around with your code myself.

There are several algorithmic and data structure-related optimizations for shortest path queries to speed up the vanilla BFS-based solution you started with. Most of the algorithmic and data-structure related optimizations are trying to address the same problem though: often batch graph computations are memory-bound. For example, there are smart ways of assigning node-IDS (e.g., according to a hilbert curve), the compressed sparse row format of storing the edges, or partitioning the neighbors of each node so that each partition fits in the lowest-level CPU caches. These optimizations do not change the total number of edges BFS will read but instead try to increase the CPU cache hit rate when reading the edges from the memory. There is also several algorithmic optimizations for the single-pair shortest-paths problem you took, i.e., when the query has a source and a destination. One well-understood one is to do a bidirectional BFS, one from the source and one from the vertex. This one for example directly decreases the number of edges BFSÂ reads.

I see these optimizations in publications and integrated into many graph processing software. However, I don’t think prefetching-like processor-level optimizations are as well studied (nor integrated into systems I study), so work here would be quite interesting. I’m curious which low-level optimizations are available that can enhance other existing optimizations.

Semih

True. But I also expect that with larger graphs, larger gains are possible. Of course, the challenges also increase.

One definitively wants to use real graphs.

I remembered this approach from our chat, but I deliberately went for something naive.

There is probably quite a bit of optimization possible above and beyond purely algorithmic gains. But it is probably not as simple as spraying prefetch instructions in the code (though, if done right, it might be better than nothing).

One has to be very careful when using the word memory bound in a graph context as memory bound has two very different aspects. There is bandwidth bound and latency bound. Graph traversals like BFS are latency bound that is why prefetching helps. On the other hand, page rank is usually bandwidth bound.

I actually doubt that you can rewrite a graph traversal in such a way that current hardware prefetchers can help. They are optimized for sequential and strided accesses. The accesses of graph traversals are too irregular. As already mentioned CSR and node reordering can improve data locality.

Other proposals add a graph prefetcher in hardware.

http://www-dyn.cl.cam.ac.uk/~tmj32/wordpress/hardware-graph-prefetchers/

This is an interesting point. Whether you’d be latency or bandwidth bound, even in BFS, will depend on the implementation, specifically your parallelism. Say you have ten threads (say on a single core machine) running a parallel BFS from a single source to traverse a large graph, I would expect you’ll be bandwidth bound. A single threaded implementation might be less bandwidth bound, so prefetchers here might help more. So, yes, if we were to parallelize Daniel’s code, the benefits of prefetching will likely go down.

… with the caveat stated in my blog post that memory access is a shared resource on multicore systems…

Even the concepts of “latency bound” and “bandwidth bound” are a bit fuzzy when it comes to modern hardware. The DRAM configuration will have a certain maximum bandwidth which is a simple product of the data transfer size and rate (e.g., DDR4-2400) and number of memory channels. Maybe this is 50 GB/s on your system.

An algorithm that on given hardware would otherwise achieve more than 50 GB/s could definitely be called DRAM bandwith limited: you are exploiting the RAM bandwidth to its limits. In principle, a reduction in memory latency wouldn’t help at all (but a RAM bandwidth increase would).

At the other end of the spectrum, you have a classic “pointer chasing” memory latency bound algorithm such as iterating through a linked list (assume the nodes are spread around randomly in memory so prefetching doesn’t let you cheat the latency). This always has an MLP factor of 1 (exactly one outstanding request at any time). The performance of this algorithm is entirely dependent on the latency: if you cut the latency in half, the runtime is cut in half as well.

The latency bound algorithms as above are usually easy to identify statically by examining the data dependency graph: they are the ones where the address for any memory access depends on the result of the prior access (or as a relaxation some prior access). You aren’t really restricted to MLP 1 algorithms either: consider simultaneously iterating from both ends of a doubly-linked list towards the center: this has an MLP of 2, but is still in some sense entirely latency bound: if you halve the latency you again halve the runtime.

We might then intuitively define bandwidth bound algorithms as those were memory accesses aren’t serially dependent at all, i.e., “infinite MLP”. Essentially, that the memory addresses are calculable without involving prior memory accesses (i.e., in the data dependence graph, the memory access nodes are all accessible without passing through any other memory access nodes). A simple example is summing all the elements in an array, or a vector dot-product or whatever: all the addresses to load are calculable without any dependence on earlier loads.

Does the above definition of bandwidth limited algorithms line up with our earlier hardware based definition (an algorithm that is capable of saturating the DRAM interface)? Unfortunately not, at least for single-threaded algorithms on most modern x86 chips (and probably most other high performance chips, but I’m not familiar with the details)!

Modern x86 chips have a limited number of buffers between the core and the memory subsystem. On Intel these are called (Line) Fill Buffers, in the literature they are more generically called MHSR (miss handling status registers). An MLP 1 algorithm will only ever use one of these at a time. An “infinite MLP” algorithm will probably fill all of them. There are a limited number of these buffers. The key observation is that on most chips, even if all of these buffers are filled,

the DRAM bandwidth cannot be reached. Intel chips have 10 such buffers, so on a system with 90 ns memory latency, the maximum achievable bandwidth (ignoring prefetching) is 64 bytes/line * 10 LFBs / (90 ns/line) = ~7.1 GB/s. Yet modern CPUs have DRAM bandwidths of ~20-30 GB/s (dual channel) or 50-60 GB/s (quad channel). So it would take several copies of the above core working in parallel to hit the DRAM bandwidth limit.So I would propose something like “concurrency limited” for algorithms which are limited by the number of outstanding requests at the core level, rather than the memory bandwidth.

One might argue that “concurrency limited” versus “bandwidth limited” is a distinction without a difference, but I think it matters. In particular, it implies that the maximum per-core bandwidth is actually directly dependent on the latency: if you cut the latency in half, your runtime is cut in half even for the apparently “bandwidth limited” algorithms: since the occupancy time of each request in the fill buffers is cut in half and so you get twice as much work done. That’s very different than a truly DRAM bandwidth limited algorithm, where memory latency barely matters.

It also matters because it contradicts advice you’ll often see: that parallelizing “memory bandwidth bound” algorithms by running them on multiple cores doesn’t work since memory bandwidth is a shared resource. Well, the fill buffers are

notshared resources, so concurrency-limited algorithmsdoscale when you add more cores, since you get more fill buffers and hence more parallel requests. Of course this scaling stops when you hit the bandwidth wall: then the parallel version of the algorithm becomes bandwidth limited (the MLP factor for each core stays the same at 10, but the observed latency, aka occupancy time increases so that the DRAM bandwith limit is respected).So I think the most useful way to characterize an algorithm, independently of hardware is to evaluate its MLP factor (maximum theoretical MLP).

Then to apply this to specific hardware, and you determine the HMLP (hardware MLP factor – essentially the number of fill buffers) and then the actual MLP will be the lower of the algorithm MLP and the HMLP. In the case the algorithm MLP is lower than the HMLP we could call the algorithm “latency bound”. In the case that the algorithm MLP is larger than the HMLP, we then also compare the achieved core bandwidth at maximum HMLP (e.g., the ~7.1 GB/s figure calculated above) to the DRAM memory bandwidth figure. If the HMLP-implied bandwidth is lower (as it is on most x86 chips), we could call the algorithm “concurrency limited”, if it is larger we could call the algorithm “RAM bandwidth limited”. Note that this evaluation is hardware dependent: any algorithm with an MLP greater than 1 could fall into any of the three categories, depending on the hardware!

This gives the following intuitive “litmus tests” for the three categorizations, based on the effect of three hypothetical hardware changes (where “Helps!” means a direct proportional effect on runtime):

Decreasing memory latency.

Increasing fill buffer count.

Increasing RAM bandwidth.

Latency BoundHelps!

Does nothing

Does nothing

Concurrency BoundHelps!

Helps!

Does nothing

RAM Bandwidth BoundDoes nothing

Does nothing

Helps!

Of course, in practice you can never change (2) without a uarch change, and you can only limited changes to change (1) or (3) – but this is more a way to think about this stuff than a tuning guide.

Which brings me back to the point I originally wanted to make, but which took a long time to set up the prerequisites! It seems to be that BFS is not likely to be latency limited, unless the average out degree of your graph is very small (close to 1). On most graphs, BFS should be concurrency-limited: as long as the current horizon is at least as large as the HMLP you should get to full concurrency and hence be no more latency-limited than another single core algorithm.

Something like DFS seems more likely to be latency limited, since it is essentially a series of pointer-chasing like serially dependent loads (of course, a lot depends on the graph shape and especially on how the brach predictor ends up working on your graph).

I almost entirely ignored hardware prefetching in the above. I don’t want to cover it fully because this is long enough, but briefly: prefetching complicates the above but doesn’t change the core conclusions. Hardware prefetching can lead to an apparent increase in the number of fill buffers, but they are still limited. You basically then end up with two different types of “concurrency limited” algorithms: prefetch friendly and prefetch unfriendly: you can use the same basic framework to analyze them, but with different HMLP values.

Those “Helps!” lists showed as numbered in the post preview, but not when I actually posted. Anyways, they line up 1,2,3 with the 3 litmus tests noted above (decreasing latency, increasing fill buffers, increasing RAM BW).

How many buffers hardware prefetcher has?

I would agree in general BFS is bounded by latency while PR may be bounded by bandwidth, particularly as compared to BFS. However, it’s often very hard to just simply say like that until it’s mapped to real hardware to see how locality comes into play.

For example [REF], Pagerank on Intel SKX and Intel KNL exhibits very different behavior with a certain size of graph. If you take a look at Fig(3(a) and (b), would you still conclude that PR is memory bound even on KNL?

I think we should be careful even when we say it’s memory bandwidth bounded, like Travis commented below.

[REF] http://heirman.net/papers/sc2018.pdf

Based in my understanding, this is the opposite of what prefetch instructions exist for – CPUs are quite good at predicting sequential and strided accesses already, but aren’t (and in many cases, can’t be) good at predicting… unpredictable fetches. The reason CPUs include prefetch instructions is so that the software developer can give a hint to the CPU that something will be needed shortly, so it should opportunistically start loading it. BFS is exactly one of the data access patterns where CPUs will struggle to predict future loads, and happens to be quite common in garbage collectors.

GC is a similar problem, Effective Prefetch for Mark-Sweep Garbage Collection (http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf)

Thanks for the reference.

There is prior work: Software Prefetching for Mark-Sweep Garbage Collection (2004).