Gender and peer review

Modern science works in the following manner. You do the research. You write a paper. You publish the paper. For historical reasons, “publishing the paper” typically means “submit it to a committee of your peers and get their seal of approval”.

We can rightfully be concerned about the unwanted biases that this committee of peers might have. For example, maybe these peers have unconscious biases against women?

It is not unreasonable. We have evidence that peer review is strongly biased in favour of authors from prestigious institutions. For example, Peter and Ceci took already accepted articles from authors at prestigious institutions, and they resubmitted them as authors from lesser institutions: they encountered an 89% rejection rate.

That, by itself, is not as worrying as it seems. It is a fact that Stanford researchers are better, on average, than researchers from that school you never heard from. Everything else being equal, it makes sense to put more trust in work from Stanford.

But gender is something else. We have no rational reason to trust the work done by men more than the work done by women.

So is peer review sexist? Webb et al. in Does double-blind review benefit female authors? write:

We found a significant interaction between gender and time (P < 0.0001), reflecting the higher female authorship post-2001 than pre-2001, but there was no significant interaction between gender and review type (...) Budden and colleagues call for the ecological and evolutionary community to revisit the issue of peer review in light of their study of BE, and their claim that double-blind review benefits female authors (...) This is despite the fact that the only evidence they supply is an increase in female first authorship in a single journal (an increase also seen in other journals that did not switch to double-blind review) rather than anything more compelling, such as a demonstration that the ratio of accepted to submitted manuscripts had increased for females relative to males after the introduction of double-blind review.

Ceci and Williams review the evidence regarding biases against women

The preponderance of evidence, including the best and largest studies, indicates no discrimination in reviewing women’s manuscripts

Sugimoto et al. find that

(…) recent meta-analysis suggests that claims of gender bias in peer review “are no longer valid”. For example, if there is gender bias in review, we would expect double-blind conditions to increase acceptance rates for female authors. However, this is not the case. Nor are manuscripts by female authors disproportionately rejected at single-blind review journals. Even when the quality of submissions is controlled for, manuscripts authored by women do not appear to be rejected at a higher rate than those authored by men. Meta-analyses and large-scale studies of grant outcomes found no gender differences after adjusting for factors such as discipline, country, institution, experience, and past research output.

So it seems that we have good news. Peer review is not, statistically speaking, sexist.

Graph algorithms and software prefetching

A lot of data in the real world can be represented as graphs: you have nodes connected through vertices. For example, you are a node in a graph where friendships are vertices.

I recently met with professor Semih Salihoglu, an expert in graph databases and algorithms. We discussed fun problem like how one can find the shortest path between two nodes in a very large graph.

Semih argued something to the effect that, often, the best you can do is a breadth-first search. That sounds scary and fancy… but it is actually super simple. Start from a given node. This node is at level 0. Then visit all its neighbors (by iterating through its vertices). These nodes are at level 1. Then visit all the nodes connected to the nodes at level 1 (excluding your starting node), these are at level 2. And so forth. The magic is that you will end up visiting once (and exactly once) all nodes that can be reached from your starting node.

With this approach, you can find the distance between any two nodes. Just keep exploring, starting from one of the two nodes, until you encounter the other node.

But what happens if the graph is very large? These nodes you are visiting are all over your memory. This means that each node access is a potential cache fault. Our processors are super fast, and they have super fast memory close to them (cache memory), but your main memory (RAM) is comparatively much slower.

Thus, when processing a large graph, you are likely memory-bound… meaning that much of the time is spent waiting for memory access. It is worse than it appears because memory access is a shared resource in multicore processors, which means that you cannot make this problem go away cheaply by buying a processor with many more cores.

Can we quantify this?

I built a large random graph made of 10 million nodes where each node has 16 random neighbors.

I pick a node at random, seek another node far away, and then I measure the time it takes to do the breadth-first search from one to the other. On a per-vertex basis, it takes 15 cycles. Don’t worry, things get much worse if the graph gets larger, but let us reflect on the fact that 15 cycles to merely look at the node identifier, check if it has been visited and if not, add it to our list… is a lot. Not counting memory accesses, we should be able to do this work in 5 cycles or less.

Can we do better than 15 cycles?

What if, right before you start processing the neighbors of one node, you told your processor to go fetch the neighbors of the next node? I have a recent post on this very topic: Is software prefetching (__builtin_prefetch) useful for performance?

In that post, I explained that Intel processors have prefetching instructions that the software can call. I also recommended to avoid them.

So what happens if I add a prefetch instruction to my graph code? I go down to 9 cycles… saving a third of the running time.

naive breadth-first search15 cycles per vertex visited
prefetched breadth-first search9 cycles per vertex visited

My code is available (in C).

My result would seem to invalidate my recommendation to avoid software prefetches. But keep in mind that my implementation is naive and limited, thrown together in a couple of hours. It is a proof of concept. What it demonstrates is that even if you are limited by memory accesses, there are still software choices you can make to help you.

I would only change my recommendation against software prefetches if we definitively could not rewrite the code differently to get the same benefits. I think we can write more clever code.

There are many problems with software prefetches. In some cases, as is the case here, it is better than nothing… But it is still a fragile hack. It helps in my particular case, but change the parameters of the graph, and things might go to hell. Update your processor and things could go to hell. And there is no way to know whether the exact way I did it is close to optimal… it works well in my case, but it might require much tuning in other instances.

So how can we write the code better? I am not certain yet.

Science and Technology links (May 18th, 2018)

  1. How is memory encoded in your brain? If you are like me, you assume that it is encoded in the manner in which your brain cells are connected together. Strong and weak connections between brain cells create memories. Some people think that it is not how memories are encoded.

    To prove that it is otherwise, scientists have recently transferred memories between snails by injections of small molecules taken from a trained snail. Maybe one day you could receive new memories through an injection. If true, this result is probably worth a Nobel prize. It is probably not true.

  2. Inflammation is a crude and undiscerning immune response that your body uses when it has nothing better to offer. One of the reasons aspirin is so useful is that it tends to reduce inflammation. There are many autoimmune diseases that can be described as “uncontrolled inflammation”. For example, many people suffer from psoriasis: their skin peels off and becomes sensitive. Richards et al. believe that most neurodegenerative diseases (such as Alzheimer’s, Parkinson’s, ALS) are of a similar nature:

    it is now clear that inflammation is (…) likely, a principal underlying cause in most if not all neurodegenerative diseases

  3. Scientists are sounding the alarm about the genetic tinkering carried out in garages and living rooms.
  4. The more intelligent you are, the less connected your brain cells are:

    (…)individuals with higher intelligence are more likely to have larger gray matter volume (…) intelligent individuals, despite their larger brains, tend to exhibit lower rates of brain activity during reasoning (…) higher intelligence in healthy individuals is related to lower values of dendritic density and arborization. These results suggest that the neuronal circuitry associated with higher intelligence is organized in a sparse and efficient manner, fostering more directed information processing and less cortical activity during reasoning.

  5. It is known that alcohol consumption has a protective effect on your heart. What about people who drink too much? A recent study found that patients with a troublesome alcohol history have a significantly lower prevalence of cardiovascular disease events, even after adjusting for demographic and traditional risk factors. Please note that it does not imply that drinking alcohol will result in a healthier or longer life.
  6. A third of us have high blood pressure. And most of us are not treated for it.
  7. Eating lots of eggs every day is safe. Don’t be scared of their cholesterol. (Credit: Mangan.)
  8. According to data collected by NASA, global temperatures have fallen for the last two years. This is probably due to the El Nino effect that caused record temperatures two years ago. What is interesting to me is that these low global temperatures get no mention at all in the press whereas a single high temperature record (like what happened two years ago) gets the front page.

    That’s a problem in my opinion. You might think that by pushing aside data that could be misinterpreted, you are protecting the public. I don’t think it works that way. People are less stupid and more organized than you might think. They will find the data, they will talk about themselves, and they will lose confidence in you (rightly so). The press and the governments should report that the temperatures are decreasing… and then explain why it does not mean that the Earth is not warming anymore.

    The Earth is definitively getting warmer, at a rate of about 0.15 degrees per decade. You best bet is to report the facts:

  9. Low-carbohydrate, high-fat diets might prevent cancer progression.
  10. Participating in the recommended 150 minutes of moderate to vigorous activity each week, such as brisk walking or biking, in middle age may be enough to reduce your heart failure risk by 31 percent. (There is no strong evidence currently that people who exercise live longer. It does seem that they are more fit, however.)

Validating UTF-8 strings using as little as 0.7 cycles per byte

Most strings found on the Internet are encoded using a particular unicode format called UTF-8. However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume them.

In a previous post, I pointed out that it takes about 8 cycles per byte to validate them using a fast finite-state machine. After hacking code found online, I showed that using SIMD instructions, we could bring this down to about 3 cycles per input byte.

Is that the best one can do? Not even close.

Many strings are just ASCII, which is a subset of UTF-8. They are easily recognized because they use just 7 bits per byte, the remaining bit is set to zero. Yet if you check each and every byte with silly scalar code, it is going to take over a cycle per byte to verify that a string is ASCII. For much better speed, you can vectorize the problem in this manner:

__m128i mask = _mm_setzero_si128();
for (...) {
    __m128i current_bytes = _mm_loadu_si128(src);
    mask = _mm_or_si128(mask, current_bytes);
__m128i has_error =  _mm_cmpgt_epi8(
         _mm_setzero_si128(), mask);
return _mm_testz_si128(has_error, has_error);

Essentially, we are loading up vector registers, computing a logical OR with a running mask. Whenever a character outside the allowed range is present, then the last bit will be set in the running mask. We continue until the very end no matter what, and only then do we examine the mask.

We can use the same general idea to validate UTF-8 strings. My code is available.

finite-state machine (is UTF-8?)8 to 8.5 cycles per input byte
determining if the string is ASCII0.07 to 0.1 cycles per input byte
vectorized code (is UTF-8?)0.7 to 0.9 cycles per input byte

If you are almost certain that most of your strings are ASCII, then it makes sense to first test whether the string is ASCII, and only then fall back on the more expensive UTF-8 test.

So we are ten times faster than a reasonable scalar implementation. I doubt this scalar implementation is as fast as it can be… but it is not naive… And my own code is not nearly optimal. It is not using AVX to say nothing of AVX-512. Furthermore, it was written in a few hours. I would not be surprised if one could double the speed using clever optimizations.

The exact results will depend on your machine and its configuration. But you can try the code.

I have created a C library out of this little project as it seems useful. Contributions are invited. For reliable results, please configure your server accordingly: benchmarking on a laptop is hazardous.

Credit: Kendall Willets made a key contribution by showing that you could “roll” counters using saturated subtractions.

Is research sick?

One of the most important database researchers of all time, Michael Stonebraker, has given a talk recently on the state of database research. I believe that many of his points are of general interest:

  • We have a lost our consumers… Researchers write for other researchers. They are being insular and ultimately irrelevant.
  • We ignore the important problems… Researchers prefer to focus on the easy problems they know how to solve right away.
  • Irrelevant theory comes to dominate… Making things work in the real work is harder than publishing yet-another-theory paper.
  • We do not have research taste… Researchers jump on whatever is fashionable with little regard to whether it makes sense (XML, Semantic Web, MapReduce, Object databases and so forth).

The core message that Stonebraker is trying to pass is that the seminal work he did years ago would no longer be possible today.

I do not share his views, but I think that they are worth discussing.

The fact that we are publishing many papers is not, by itself, a problem. The irrelevant theory can be ignored. The fact that many people follow trends mindlessly whereas only a few make their own ways, is just human nature.

So what is the real problem?

My take is that there is an underlying supply-and-demand issue. We are training many more PhDs than there are desirable tenure-track jobs. This puts early-career researchers in a overcompetitive game. If you can signal your value by getting papers accepted at competitive venues, then that is what you are going to do. If you need to temporarily exagerate your accomplishments, or find shortcuts, maybe you will do so to secure the much needed job.

You cannot magically make a field less competitive without tackling the supply and demand. Either you reduce the number of PhDs, or you increase the number of desirable research jobs. I’d vote for the latter. Stonebraker says nothing about this. In fact, though I like him well enough, he is an elitist who cares for little but the top-20 schools in the US. The problem is that even just these top-20 schools produce many more PhDs than they consume. Is it any wonder if all these PhD students desperately play games?

The reason Stonebraker was able to get a tenure-track job in a good school when he completed his PhD without any publication is… Economics 101… there was no shortage of tenure-track jobs back then.

Is it any wonder if the new professors recruited in an overcompetitive system are status-obsessed folks?

There is no doubt that some PhD students and young faculty members will pursue volume at the expense of quality and relevance. However, there is no reason to believe that the best American universities (the ones Stonebraker cares about) do not already take this into account. Do you really think you can go from a PhD program to a faculty job at Harvard, Stanford or MIT without one or two great papers? Do you really think that these schools will pass on someone who has produced seminal work, and pick the fellow who has a large volume of irrelevant papers instead? If that’s the case, where are the examples?

Meanwhile, there are more great papers being published in my area, in any given year, than I can read. Here is a hint: you can find many of them on arXiv. No need to go to an expensive conference at some remote location. It might be the case that some disciplines are corrupted, but computer science research feels quite worthwhile, despite all its flaws.

Science and Technology links (May 11th, 2018)

  1. It looks like avoiding food most of the day, even if you do not eat less, is enough to partially rejuvenate you.
  2. Google researchers use deep learning to emulate how mammals find their way in challenging environments.
  3. We know that athletes live longer than the rest of us. It turns out that Chess players also live longer.
  4. Google claims to have new pods of TPUs (tensor processor units) reaching 100 petaflops. I don’t know if this number makes sense. It seems higher than many supercomputers. Some estimates indicate that our brains have about 1 exaflop, so if you put 10 pods together, you should have enough computational power to match the human brain. If Google has this technology, I’d consider it to be a massive step forward.
  5. Gwern Branwen has a fantastic essay on the genetics of the novel Dune. In the novel, thousands of years are spent improving human genetics. Gwern suggests that this is wrong:

    even the most pessimistic estimates of how many generations it might take to drastically increase average human intelligence (…) might be 20 generations, certainly not thousands of generations

    The point seems to be that there is no need to wait for the emergence of new genes. Instead, we just have to select the right alleles in many genes which is far easier.

  6. In a clinical trial, weight loss cured most people of type 2 diabetes.
  7. There is currently no cure for male pattern baldness. A drug called WAY-316606 has been found to cure baldness… in mice. I could use more hair.

How quickly can you check that a string is valid unicode (UTF-8)?

Though character strings are represented as bytes (values in [0,255]), not all sequences of bytes are valid strings. By far the most popular character encoding today is UTF-8, part of the unicode standard. How quickly can we check whether a sequence of bytes is valid UTF-8?

Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal. That is, the most significant bit is always zero.

You can check that a string is made of ASCII characters easily in C:

bool is_ascii(const signed char *c, size_t len) {
  for (size_t i = 0; i < len; i++) {
    if(c[i] < 0) return false;
  return true;

However, there are many more unicode characters than can be represented using a single byte. For other characters, outside the ASCII set, we need to use two or more bytes. All of these “fancier” characters are made of sequences of bytes all having the most significant bit set to 1. However, there are somewhat esoteric restrictions:

  • All of the two-byte characters are made of a byte in [0xC2,0xDF] followed by a byte in [0x80,0xBF].
  • There are four types of characters made of three bytes. For example, if the first by is 0xE0, then the next byte must be in [0xA0,0xBF] followed by a byte in [0x80,0xBF].
    • It is all quite boring but can be summarized by the following table:

      First ByteSecond ByteThird ByteFourth Byte

      So, how quickly can we check whether a string satisfies these conditions?

      I went looking for handy C/C++ code. I did not want to use a framework or a command-line tool.

      The first thing I found is Björn Höhrmann’s finite-state machine. It looks quite fast. Without getting in the details, given a small table that includes character classes and state transitions, the gist of Höhrmann’s code consists in repeatedly calling this small function:

      bool is_ok(uint32_t* state, uint32_t byte) {
        uint32_t type = utf8d[byte];
        *state = utf8d[256 + *state * 16 + type];
        return (*state != 1); // true on error 

      Then I went looking for a fancier, vectorized, solution. That is, I want a version that uses advanced vector registers.

      I found something sensible by Olivier Goffart. Goffart’s original code translates UTF-8 into UTF-16 which is more than I want done. So I modified his code slightly, mostly by removing the unneeded part. His code will only run on x64 processors.

      To test these functions, I wanted to generate quickly some random strings, but to measure accurately the string, I need it to be valid UTF-8. So I simply generated ASCII strings. This makes the problem easier, so I probably underestimate the difficulty of the problem. This problem is obviously dependent on the data type and lots of interesting inputs are mostly just ASCII anyhow.

      Olivier Goffart’s code “cheats” and short-circuit the processing when detecting ASCII code. That’s fine, but I created two versions of his function, one with and one without the “cheat”.

      So, how quickly can these functions check that strings are valid UTF-8?

      string sizeis ASCII?Höhrmann’s finite-state machine Goffart’s (with ASCII cheat) Goffart’s (no ASCII cheat)
      32~2.5 cycles per byte~8 cycles per byte~5 cycles per byte~6 cycles per byte
      80~2 cycles per byte~8 cycles per byte~1.7 cycles per byte~4 cycles per byte
      512~1.5 cycles per byte~8 cycles per byte~0.7 cycles per byte~3 cycles per byte

      My source code is available.

      The vectorized code gives us a nice boost… Sadly, in many applications, a lot of the strings can be quite small. In such cases, it seems that we need to spend something close to 8 cycles per byte just to check that the string is valid?

      In many cases, you could short-circuit the check and just verify that the string is an ASCII string, but it is still not cheap, at about 2 cycles per input byte.

      I would not consider any of the code that I have used to be “highly optimized” so it is likely that we can do much better. How much better remains an open question to me.

      Update: Daniel Dunbar wrote on Twitter

      I would expect that in practice best version would be highly optimized ascii only check for short segments, with fallback to full check if any in the segment fail

      That is close to Goffart’s approach.

Science and Technology links (May 5th, 2018)

  1. Oculus, a subsidiary of Facebook, has released its $200 VR headset (the Oculus Go). You can order it on Amazon. The reviews are good. It is standalone and wireless which is excellent. The higher-quality Oculus Rift and its nifty controllers are down to only $400, with the caveat that it needs a cable back to a powerful PC.

    Some analysts are predicting massive growth for virtual reality headsets this year: IDC anticipates sales to reach 12.4 million units.
    To me, the killer application has to be conferencing in virtual reality. Who wouldn’t like to go chat with a friend in some virtual world? But it is unclear whether we have the hardware to do it: you’d need good eye tracking as it is needed if you are going to be able to look at people in the eyes.

  2. Consuming glucose (sugar) impairs memory (in rats). The theory is that glucose reduces neurogenesis: you are making fewer new neurons when eating sugar.
  3. At age 20, the life expectancy was another 47 years (age 67/68) in 1930. This means that when the retirement age of 65 was enacted in the US and other countries, we expected people to have reached the very end of their life by then.
  4. Some coffee drinkers, me included, report side-effects when abstaining from coffee. Yet the belief that one has ingested caffeine is sufficient to reduce caffeine withdrawal symptoms and cravings.
  5. A private charity, the Charity Open Access Fund, pays publishers to ensure that research papers are available freely (under open access). Oxford University Press was apparently happy to take the money, but failed to deliver: a total of 34 per cent of articles paid for by the Charity Open Access Fund in 2016-17 were not made available as promised. The Oxford University Press is treated as a tax-exempt non-profit.
  6. It is believed that part of the biological aging process can be attributed to a reduction in nicotinamide adenine dinucleotide (NAD+) in our cells. A new paper in Cell Metabolism shows that this reduction can be reversed with some drug called 78c. It improves muscle function, exercise capacity, and cardiac function in mice.
  7. In Singapore, the number of people aged 65 and above is projected to double to 900,000 or 25% of the population by 2030. In Japan, 27% of the population was 65 years or older in 2016, it will be over 33% by 2030.

How fast can you parse JSON?

JSON has become the de facto standard exchange format on the web today. A JSON document is quite simple and is akin to a simplified form of JavaScript:

     "Image": {
        "Width":  800,
        "Height": 600,
        "Animated" : false,
        "IDs": [116, 943, 234, 38793]

These documents need to be generated and parsed on a large scale. Thankfully, we have many fast libraries to parse and manipulate JSON documents.

In a recent paper by Microsoft (Mison: A Fast JSON Parser for Data Analytics), the researchers report parsing JSON document at 0.1 or 0.2 GB/s with common libraries such as RapidJSON. It is hard to tell the exact number as you need to read a tiny plot, but I have the right ballpark. They use a 3.5 GHz processor, so that’s 8 to 16 cycles per input byte of data.

Does it make sense?

I don’t have much experience processing lots of JSON, but I can use a library. RapidJSON is handy enough. If you have a JSON document in a memory buffer, all you need are a few lines:

rapidjson::Document d;
if(! d.ParseInsitu(buffer).HasParseError()) {
 // you are done parsing

This “ParseInsitu” approach modifies the input buffer (for faster handling of the strings), but is fastest. If you have a buffer that you do not want to modify, you can call “Parse” instead.

To run an example, I am parsing one sizeable “twitter.json” test document. I am using a Linux server with a Skylake processor. I parse the document 10 times and check that the minimum and the average timings are close.

ParseInsitu Parse
4.7 cycles / byte 7.1 cycles / byte

This is the time needed to parse the whole document into a model. You can get even better performance if you use the streaming API that RapidJSON provides.

Though I admit that my numbers are preliminary and partial, they suggest to me that Microsoft researchers might not have given RapidJSON all its chances, since their numbers are closer to the “Parse” function which is slower. It is possible that they do not consider it acceptable that the input buffer is modified but I cannot find any documentation to this effect, nor any related rationale. Given that they did not provide their code, it is hard to tell what they did exactly with RapidJSON.

The Microsoft researchers report results roughly 10x better than RapidJSON, equivalent to a fraction of a cycle per input byte. The caveat is that they only selectively parse the document, extracting only subcomponents of the document. As far as I can tell, their software is not freely available.

How would they fare against an optimized application of the RapidJSON library? I am not sure. At a glance, it does not seem implausible that they might have underestimated the speed of RapidJSON by a factor of two.

In their paper, the Java-based JSON libraries (GSON and Jackson) are fast, often faster than RapidJSON even if RapidJSON is written in C++. Is that fair? I am not, in principle, surprised that Java can be faster than C++. And I am not very familiar with RapidJSON… but it looks like performance-oriented C++. C++ is not always faster than Java but in the hands of the right people, I expect it to do well.

So I went looking for a credible performance benchmark that includes both C++ and Java JSON libraries and found nothing. Google is failing me.

In any case, to answer my own question, it seems that parsing JSON should take about 8 cycles per input byte on a recent Intel processor. Maybe less if you are clever. So you should expect to spend 2 or 3 seconds parsing one gigabyte of JSON data.

I make my code available.

Is software prefetching (__builtin_prefetch) useful for performance?

Many software performance problems have to do with data access. You could have the most powerful processor in the world, if the data is not available at the right time, the computation will be delayed.

It is intuitive. We used to locate libraries close to universities. In fact, universities were built around libraries. We also aggregate smart and knowledgeable people together.

Sadly, our computers are organized with processors on one side and memory on the other. To make matters worse, memory chips are simply unable to keep up with the processing power of processors. Thus we introduce caches. These are fast memory stores that are stuck to the processor. We have a whole hierarchy of CPU caches.

Happily, processor vendors like Intel have invested a lot of effort in making sure that the right data is made available at the right time. Thus if you access data according to a predictable pattern the processor will do a good job. Also, the processor can reorder its instructions and execute a load instruction earlier if it is useful. Simply put, a lot of engineering went to make sure that data-access problems are minimized. You might think that, as a programmer, you can do better… but truth is, most times you couldn’t. At least, most of us, most of the time, couldn’t.

You can still “blind” your processor. Some expensive instructions like the division will apparently prevent processors from fetching the data in time. At the very least, they can prevent some instruction reordering. Intuitively, the processor is so busy processing a monstrous instruction that it has no time to see what is coming next.

You can also thoroughly confuse the processor by doing many different things in an interleaved manner.

In a comment to one of my earlier blog post, someone suggested that I could improve the performance of my code by using software prefetches. That is, processor vendors like Intel provide us with instructions that can serve as hints to the processor… effectively telling the processor ahead of time that some data will be needed. Using the GCC and Clang compiler, you can invoke the __builtin_prefetch intrinsic function to get this effect. I do not know if other compilers, like that of Visual Studio, have this function.

The instructions generated by this function have a cost of their own, so they can definitively make your software slower. Still, when inserted in just the right manner in the right code, they can improve the performance.

Should you be using __builtin_prefetch? My experience says that you probably shouldn’t. I don’t like hard rules, but I have not yet seen a scenario where they are obviously a good idea. That is, I claim that if software prefetching is useful, that’s probably because your code could use some work.

I asked the commenter to provide an example where software prefetching was a good idea. He offered a scenario where we sum the values of bytes within an array…

for (int j = 0; j < ...; j++) {
    counter += bytes[p];
    p = (p + 513) & (NUMBYTES - 1);

This code, by itself, will not benefit from software prefetching. However, the commenter interleaved the sums…

for (int j = 0; j < ...; j++) {
  for(int i = 0; i < batchsize; i++) {
    counter[i] += bytes[p[i]];
    p[i] = (p[i] + 513) & (NUMBYTES - 1);

If you pick batchsize large enough, you can confuse the processor and get some benefits from using the __builtin_prefetch function. I suppose that it is because the access pattern looks too unpredictable to the processor.

Let us look at the numbers I get…

sequential sums3 s
interleaved sums4.4 s
interleaved sums with __builtin_prefetch4.0 s

The prefetching improves the performance of the interleaved sums by 10%, but you can get much better performance simply by doing the sums one by one. The resulting code will be simpler, easier to debug, more modular and faster.

To provide evidence that __builtin_prefetch is useful, you need to take code, optimize it as much as you can… and then, only then, show that adding __builtin_prefetch improves the performance.

My source code is available.

Credit: I’d like to thank Simon Hardy-Francis for his code contribution.