Learning from the object-oriented mania

Back when I started programming professionally, every expert and every software engineering professor would swear by object-oriented programming. Resistance was futile. History had spoken: the future was object-oriented.

It is hard to understate how strong the mania was. In education, we started calling textbooks and videos ‘learning objects‘. Educators would soon ‘combine learning objects and reuse them‘.

A competitor to a client I was working on at the time had written a server in C. They had to pay lip service to object-oriented programming, so they said that their code was ‘object-oriented.

I once led a project to build an image compression system. They insisted that before we even wrote a single line of code, we planned it out using ‘UML’. It had to be object-oriented from the start, you see.

You had to know your object-oriented design patterns, or you could not be taken seriously.

People rewrote their database engines so that they would be object-oriented.

More than 25 years later, we can finally say, without needing much courage, that it was insane, outrageous, and terribly wasteful. 

Yet, even today, the pressure remains on. Students are compelled to write simple projects using multiple classes. Not just learn the principles of object-oriented programming, which is fair enough, but we still demand that they embrace the ideology.

To be fair, some of the basic principles behind object-oriented programming can be useful. At least, you should know about them.

But the mania was unwarranted and harmful.

The lesson you should draw is not that object-oriented is bad, but rather that whatever is the current trendy technique and trendy idea, is likely grossly overrated.

The social mechanism is constantly in action, though it is no longer acting for object-oriented programming. It takes many forms. Not long ago, you had to wear a mask to attend a conference. Everyone ‘knew’ that masks stopped viruses and had no side-effect… just like everyone just knew that object-oriented programming makes better and more maintainable software, without negative side-effects. All experts agree. All figure of authorities agree. The written press agrees. The celebrities agree. The social pressure to conform is everywhere. It must be true, it has to be true. Anyone disagreeing is a bad person.

You can recognize such a social contagion by its telltale signs.

  1. Rapid Spread: A social contagion spreads quickly through a group or community, much like a wildfire. One day everyone is talking about the latest object-oriented pattern, and the next day, everyone is putting it into practice.
  2. Amplification: You often observe the emergence of ‘influencers’, people who gain high social status and use their newly found position to push further the hype. The object-oriented mania was driven by many key players who made a fortune in the process. They appeared in popular shows, magazines, and so forth.
  3. Peer Influence: Social contagion often relies on peer influence. E.g., everyone around you starts talking about object-oriented programming.
  4. Conformity: People often mimic the behaviors or attitudes of others in their group, leading to a conformity effect. People who do not conform are often excluded, either explicitly or implicitly. For example, object-oriented started to appear in job ads and was promoted by government agencies.
  5. Aggressive Behavior: You see a significant change from usual behavior as irrationality creeps in. If you criticize object-oriented programming, something is wrong with you!
  6. Grandiose Beliefs or Delusions: Claims that object-oriented programming would forever change the software industry for the better were everywhere. You could just easily reuse your objects and classes from one project to the other. Never mind that none of these claims could ever be sustained.
  7. Risky Behavior: Entire businesses bet their capital on projects trying to reinvent some established tool in an object-oriented manner. People kept throwing caution to the wind: let us rebuild everything the one true way, what is the worse that can happen?

Forwarding references in C++

In C++, there are different ways to pass a value to a function. Typically, at any given time, an object in C++ ‘belongs’ to a single function. The various ways to call a function differ in who owns the object, the caller or the callee (the function being called).

The simplest one is that we pass by value. In such cases, a copy is typically made of the object and both the caller and the callee own a copy.

void value(MyClass obj) {}

We can pass by reference. You recognize a reference by the single ampersand (&). The caller owns the object, but the callee gets access to it.

void reference(MyClass& obj) {}

You can also pass by an “rvalue reference” which you recognize by the ‘&&’ symbols. In such cases while the caller initially creates the object, but its ownership is passed to the callee. I personally dislike the expression ‘rvalue reference’ and I would have preferred something less technical.

void rvalue_reference(MyClass&& obj) {}

However, in some instances, you do not care whether your function gets to own the value, or has merely a reference to it. Writing two functions duplicates the code. Instead, you can then use a forwarding reference:

template <typename T>
void forwarding_reference(T&& obj) {}

It looks like an rvalue reference, but it is not: it can be either a normal reference or an rvalue reference depending on how you call it.

Here is how you might call these functions in practice:

MyClass obj;
value(obj);
reference(obj);
rvalue_reference(MyClass());
forwarding_reference(obj);
forwarding_reference(MyClass());

The following table is a summary. A forwarding reference might be either a regular reference or an rvalue reference depending on how it is called.

caller owns? callee owns?
by value yes yes
by reference (&) yes no
by rvalue reference (&&) no yes

Peer review is not the gold standard in science

Peer review as we know it today was introduced very late, over a century after the scientific revolution. It happened after Einstein’s time… arguably the most productive era in science. Current scientists often equate a success with the publication in a selective peer-reviewed venue. But that was never the scientific paradigm. In fact, it is pre-scientific thinking. Back in Einstein’s time, many scientists believed in the ether. It would have been difficult to dismiss the ether as a concept. The prudent approach would have been to pay lip service to the ether. Similarly, most scientists believed in eugenics. They believed in forced sterilization for the greater good. Many of the racist laws in the US followed straight from progressive science. Opposing eugenics would have been difficult in the context of peer review. It would have been difficult to challenge eugenics openly as a scientists.

Siler et al. (2014) looked at published manuscripts that were initially rejected. They find:

Of the 808 eventually published articles in our dataset, our three focal journals rejected many highly cited manuscripts, including the 14 most popular; roughly the top 2 percent. Of those 14 articles, 12 were desk-rejected. This finding raises concerns regarding whether peer review is ill-suited to recognize and gestate the most impactful ideas and research.

Recently, people like Matt Ridley challenged the idea that the SARS-Cov2 virus originated from nature. Back when he published his book on the topic, it would have been difficult to pass peer review.

You may not remember, but early on, it would widely accepted that the lab origin of SARS-Cov2 was only for far-right conspiracy theorists. The Canadian State broadcaster (CBC) told us, in its ‘science’ section:

One of the most persistent and widespread pieces of disinformation during the COVID-19 pandemic has been the conspiracy theory that the novel coronavirus that causes the disease was created in a lab — and was let loose either by accident or on purpose by some nefarious actor.

In the US Senator Cotton suggested that thespread of a coronavirus is connected to research at the Wuhan institute of virology. In response, the Washington Post wrote:

Sen. Tom Cotton (R-Ark.) keeps repeating a coronavirus conspiracy theory that was already debunked. (…) In response to Cotton’s remarks, as well as in previous interviews with The Washington Post, numerous experts dismissed the possibility the coronavirus may be man-made.

Here is what one of the most reputed medical journal (The Lancet) published:

We stand together to strongly condemn conspiracy theories suggesting that COVID-19 does not have a natural origin.

The article omits the fact that the authors have glaring conflicts of interest (undisclosed).

Thacker describes some of the event in a piece for BMJ:

But the effort to brand serious consideration of a lab leak a “conspiracy theory” only ramped up. Filippa Lentzos, codirector of the Centre for Science and Security Studies at King’s College, London, told the Wall Street Journal, “Some of the scientists in this area very quickly closed ranks.” She added, “There were people that did not talk about this, because they feared for their careers. They feared for their grants.

Daszak had support. After he wrote an essay for the Guardian in June 2020 attacking the former head of MI6 for saying that the pandemic could have “started as an accident,” Jeremy Farrar, director of the Wellcome Trust [a major funder] and co-signer of the Lancet letter, promoted Daszak’s essay on Twitter, saying that Daszak was “always worth reading.”

Daszak’s behind-the-scenes role in orchestrating the statement in the Lancet came to light in November 2020 in emails obtained through freedom of information requests by the watchdog group US Right To Know.

“Please note that this statement will not have EcoHealth Alliance logo on it and will not be identifiable as coming from any one organization or person,” wrote Daszak in a February email, while sending around a draft of the statement for signatories. In another email, Daszak considered removing his name from the statement “so it has some distance from us and therefore doesn’t work in a counterproductive way.”

Several of the 27 scientists who signed the letter Daszak circulated did so using other professional affiliations and omitted reporting their ties to EcoHealth Alliance.

For Richard Ebright, professor of molecular biology at Rutgers University in New Jersey and a biosafety expert, scientific journals were complicit in helping to shout down any mention of a lab leak. “That means Nature, Science, and the Lancet,” he says. In recent months he and dozens of academics have signed several open letters rejecting conspiracy theory accusations and calling for an open investigation of the pandemic’s origins.

Efforts to characterise the lab leak scenario as unworthy of serious consideration were far reaching, sometimes affecting reporting that had first appeared well before the covid-19 pandemic. For example, in March 2020 Nature Medicine added an editor’s note (“Scientists believe that an animal is the most likely source of the coronavirus”) to a 2015 paper on the creation of a hybrid version of a SARS virus, co-written by Shi.

Here are the facts as we knew them back then… as anyone could know…

  • There was an outbreak caused by a bat sarbecovirus, in the one city in the world that had been collecting hundreds of bat sarbecoviruses and experimenting on them.
  • It happened one year after that lab proposed inserting the one feature that distinguishes SARS‑CoV‑2 from all other viruses.
  • The lab in question refuses to this day to release the database of the viruses it had been working on.
  • Virus leaks have been common.

It was always sensible to ask whether SARS-CoV-2 came from the Wuhan lab. Yet this was openly censored. As is often the case, instead of reflecting on this failure, many people rewrite history. “We never denied it could have come from a lab”, they say. “We never denied that it could have been human-made,” they say. But they very explicitly and strongly did so. They specifically and repeatedly said that this virus could not have been made in a laboratory:  yet a funding application to do exactly that, a few years before, had been submitted to the US government by Daszak, the very man who insisted that the lab origin was a conspiracy theory.

Of course, knowledgeable scientists knew that the lab origin was a possibility. They did not dare to speak up. Would you speak up when it could mean the end of your career?

This was not at all an isolated incident. Dr. Scott Atlas was censored by Stanford for questioning the covid dogma. The Stanford Review writes:

This censure, now a black mark on the University, was unquestionably motivated by political animosity. Atlas, a health policy expert who worked as a professor at Stanford Medical School for fourteen years, chose to serve his country by taking an advisory role in the Trump Administration’s White House Coronavirus Task Force. As an advisor, Atlas suggested reopening schools and businesses and pushed back against draconian lockdown policies.

You might answer… « Attacking people for getting closer to the truth isn’t new » But science seeks to address this very point. In fact, it is the very essence of the epistemology of science: the recognition that truth is not arrived by social consensus or by following the powerful. There are many ways to describe science, but to a first approximation…  Science is the process whereas anyone can post ideas and results for others to replicate, and everyone get to fail in public, and, hopefully correct themselves. Science the opposite of a gatekeeping process, it is, by its very nature, a progressive and open process.

It does not mean you should not use peer review publication. But you need to recognize that it is not the reference in science. Evidence is evidence. Consensus is not evidence.

Remember: ‘The reasonable man adapts himself to the world; the unreasonable man persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.’

How fast can you construct a small list of strings in C for Python?

Python is probably the most popular programming language in the world right now. Python is easy to extend using C code.

You may want to return from Python a small data structure. When crossing from C to Python, there is an overhead. Thus, if performance is a concern, you do not want to return lots of small values such as hundreds of individual small strings.

However, you may want to return a list of strings such as

['zero elephant', 'one elephant is having fun', 'two elephants are having fun',
'three elephants are meeting with the president', 'four elephants',
'five elephants in an hexagon', 'six elephants are playing the saxophone',
'seven elephants are visiting the school', 'eight elephants are at Church',
'nine elephants are having a party']

I am going to assume that these strings are in an array called ‘numbers’ in my C code. Of course, if the strings are known at compile time, I could just precompute the array and return it. However, I want to generate the content dynamically.

A reasonable C function which will return an list is as follows:

PyObject* all_strings(PyObject* self) {
  size_t len = sizeof(numbers) / sizeof(numbers[0]);
  PyObject* pyList = PyList_New(len);
  for (size_t i = 0; i < len; ++i) {
    PyObject* pyString = PyUnicode_FromString(numbers[i]);
    PyList_SET_ITEM(pyList, i, pyString);
  }
  return pyList;
}

It looks a bit complicated, but it is not. It is just an application of the Python C functions. The key element is the PyUnicode_FromString function which automatically takes a pointer to an UTF-8 string, and converts it into a Python string. In our case, all our strings are ASCII, so the function does more work than needed. In this instance, I know ahead of time the size of the list (hence PyList_New(len)), but I could also have constructed an empty list (PyList_New(0)) and appended strings to it as needed.

Can we do better? We might. Python introduced lower-level string construction functions. You start by constructing a string that cannot be resized, and you must tell it whether the string content fits in one byte (ASCII or Latin1), or whether it requires more storage per character (UTF-16 without surrogate pairs) or whether you need the full Unicode range. In my case, it is easy: my strings are pure ASCII. In the general case, you would need to ascertain the string type (a functionality we plan to add to the simdutf library).

The code becomes slightly more complicated, but barely so…

PyObject* all_strings_new(PyObject* self) {
  size_t len = sizeof(numbers) / sizeof(numbers[0]);
  PyObject* pyList = PyList_New(len);
  for (size_t i = 0; i < len; ++i) {
    const char * str = numbers[i];
    size_t len = strlen(str);
    PyObject* py_string = PyUnicode_New(len, 127);
    Py_UCS1* data = PyUnicode_1BYTE_DATA(py_string);
    memcpy(data, str, len);
    PyList_SET_ITEM(pyList, i, py_string);
  }
  return pyList;
}

I wrote a small benchmark. You need Python and a development environment to build it. I ran the benchmark on my Apple M2 laptop, using LLVM 15 and Python 3.12. Your results will vary.

In my case, I get a very significant boost (2x) from the lower-level approach.

basic 270 ns/structure
lower-level 140 ns/structure

My data structure contains 295 bytes of strings, so the lower-level approach generates strings at over 2 GB/s, whereas the conventional approach has half the speed.

Should Node.js be built with ClangCL under Windows?

Under Windows, when using Visual Studio to build C++ code, there are two possible compiler strategies. The Visual Studio compiler (often referred to as MSVC) is the default compiler provided by Microsoft for Windows development. In Debug mode, the regular Visual Studio compiler produces faster compilation times and more performant code compared to ClangCL. ClangCL is part of the Clang/LLVM project, which is an open-source compiler toolchain. ClangCL is compatible with the Visual Studio runtime and links with the Microsoft implementation of the Standard Library. It’s available as an optional component in Visual Studio 2019 and later versions.

In Debug mode, I find that the regular Visual Studio compiler builds faster. However, in release mode, I found empirically that ClangCL approach may provide more performant code. On some micro-benchmarks, the difference can be large (e.g., 40%) although I expect more modest gains on complex systems.

As of Chrome 64, Google Chrome for Windows is compiled with ClangCL. Thus Clang is now used to build Chrome for all platforms it runs on, including macOS, iOS, Linux, Chrome OS, Android, and Windows. Firefox switched to ClangCL in 2018. And at least some game developers have adopted ClangCL.

Node.js is an open-source, cross-platform JavaScript runtime environment. It allows developers to execute JavaScript code outside of a web browser. Unlike traditional JavaScript, which primarily runs in browsers, Node.js enables server-side execution of JavaScript. Node.js is part of popular web development stacks Node.js relies on the Google Chrome V8 JavaScript Engine: Node.js is built on the V8 JavaScript engine, the same engine used by Google Chrome.

Node.js is built under Windows using the regular Visual Studio compiler. Thanks in large part to Michaël Zasso, it is possible to build the Node.js under Windows with ClangCL. Could it improve the performance?

To start answering this question, I ran the standard V8 benchmarks from Node.js. These benchmarks mostly focus on V8 performance and are not affected by changes in other components. For my tests, I use Visual Studio 2022 on Microsoft Surface Laptop Studio.

All results point at improvements. That is, on average, the speed is greater with ClangCL than using the standard Visual Studio compiler. However, there is much noise in my numbers. Using the V8 benchmarks, only one test was statistically strong (serialize.js len=256).

function improvement
v8\get-stats getHeapSpaceStatistics 3% +/- 11%
v8\get-stats getHeapStatistics 10% +/- 11%
v8\serialize.js len=256 6% +/- 2%
v8\serialize.js len=16384 2% +/- 2%
v8\serialize.js len=524288 19% +/- 50%

I should stress that compilers have strengths and weaknesses. The regular Visual Studio compiler is perfectly capable and you should expect it to do better in some instances. And Microsoft have some of the best compiler engineers in the world: each new version might bring in improvements.

Furthermore, some tasks and benchmarks are not necessarily affected by the choice of a compiler: e.g., a network access, a disk processing routine, or a memory allocation stress test.

Yet it appears that there might be benefits to a transition to ClangCL for Node.js.

Careful with Pair-of-Registers instructions on Apple Silicon

Egor Bogatov is an engineer working on C# compiler technology at Microsoft. He had an intriguing remark about a performance regression on Apple hardware following what appears to be an optimization. The .NET 9.0 runtime introduced the optimization where two loads (ldr) could be combined into a single load (ldp). It is a typical peephole optimization. Yet it made things much slower in some cases.


Under ARM, the ldr instruction is used to load a single value from memory into a register. It operates on a single register at a time. Its assembly syntax is straightforward ldr Rd, [Rn, #offset]. The ldp instruction (Load Pair of Registers) loads two consecutive values from memory into two registers simultaneously. Its assembly syntax is similar but there are two destination registers: ldp Rd1, Rd2, [Rn, #offset]. The ldp instruction loads two 32-bit words or two 64-bit words from memory, and writes them to two registers.

Given a choice, it seems that you should prefer the ldp instruction. After all, it is a single instruction. But there is a catch on Apple silicon: if you are loading data from a memory that was just written to, there might be a significant penalty to ldp.

To illustrate, let us consider the case where we write and load two values repeatedly using two loads and two stores:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldr %w0, [%2]\n"
                   "ldr %w1, [%2, #4]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr):);
}

Next, let us consider an optimized approach where we combine the two loads into a single one:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldp %w0, %w1, [%2]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr) :);
}

It would be surprising if this new version was slower, but it can be. The code for the benchmark is available. I benchmarked both on AWS using Amazon’s graviton 3 processors, and on Apple M2. Your results will vary.

function graviton 3 Apple M2
2 loads, 2 stores 2.2 ms/loop 0.68 ms/loop
1 load, 2 stores 1.6 ms/loop 1.6 ms/loop

I have no particular insight as to why it might be, but my guess is that Apple Silicon has a Store-to-Load forwarding optimization that does not work with Pair-Of-Registers loads and stores.

There is an Apple Silicon CPU Optimization Guide which might provide better insight.

Large language models (e.g., ChatGPT) as research assistants

Software can beat human beings at most games… from Chess to Go, and even poker. Large language models like GPT-4 offered through services such as ChatGPT allow us to solve a new breed of problems. GPT-4 can beat 90% of human beings at the bar exam. Artificial intelligence can match math Olympians.

The primary skills of academics are language-related: synthesis, analogy, extrapolation, etc. Academics analyze the literature, identify gaps, and formulate research questions. They review and synthesize existing research. They write research papers, grant proposals, and reports. Being able to produce well-structured and grammatically correct prose is a vital skill for academics.

Unsurprisingly, software and artificial intelligence can help academics, and maybe replace them in some cases. Liang et al. found that an increasing number of research papers are written with tools like GPT-4 (up to 18% in some fields). It is quite certain that in the near future, a majority of all research papers will be written with the help of artificial intelligence. I suspect that they will be reviewed with artificial intelligence as well. We might soon face a closed loop where software writes papers while other software reviews it.

I encourage scholars to apply artificial intelligence immediately for tasks such as…

  1. Querying a document. A tool like BingChat from Microsoft allows you to open a PDF document and query it. You may ask “what are the main findings of this study?” or “are there any practical applications for this work?”.
  2. Improve text. Many academics, like myself, use English as a second language. Of course, large language models can translate, but they can also improve your wording. It is more than a mere grammar checker: it can rewrite part of your text, correcting bad usages as it goes.
  3. Idea generation. I used to spend a lot of time chatting with colleagues about a vague idea I had. “How could we check whether X is true?” A tool like ChatGPT can help you get started. If you ask how to design an experiment to check a given hypothesis, it can often do a surprisingly good job.
  4. Grant applications. You can use tools like ChatGTP to help you with grant applications. Ask it to make up short-term and long-term objectives, sketch a methodology and discuss the impact of your work… it will come up with something credible right away. It is likely that thousands of grant applications have been written with artificial intelligence.
  5. Writing code. You are not much of a programmer, but you want an R script that will load data from your Excel spreadsheet and do some statistical analysis? ChatGPT will do it for you.
  6. Find reviewers and journals. Sometimes you have done some work and you would like help picking the right journal, a tool like ChatGPT can help. If a student of yours finished their thesis, ChatGPT can help you identify prospective referees.

I suspect that much academic work will soon greatly benefit from artificial intelligence to the point where a few academics will be able to do the work that required an entire research institute in the past.

And this new technology should make mediocre academics even less useful, relatively speaking. If artificial intelligence can write credible papers and grant applications, what is the worth of someone who can barely do these things?

You would think that these technological advances should accelerate progress. But, as argued by Patrick Collison and Michael Nielsen, science productivity has been falling despite all our technological progress. Physics is not advancing faster today than it did in the first half of the XXth century. It may even be stagnant in relative terms. I do not think that we should hastily conclude that ChatGPT will somehow accelerate the rate of progress in Physics. As Clusmann et al. point out:  it may simply ease scientific misconduct. We could soon be drowning in a sea of automatically generated documents. Messeri and Crockett put it elegantly:

AI tools in science risks introducing a phase of scientific enquiry in which we produce more but understand less

Yet there are reasons to be optimistic. By allowing a small group of researchers to be highly productive, by freeing them to explore further with less funding, we could be on the verge of entering into a new era of scientific progress. However, it may not be directly measurable using our conventional tools. It may not appear as more highly cited papers or through large grants. A good illustration is Hugging Face, a site where thousands of engineers from all over the world explore new artificial-intelligence models. This type of work is undeniably scientific research: we have metrics, hypotheses, testing, reproducibility, etc. However, it does not look like ‘academic work’.

In any case, conventional academics will be increasingly challenged. Ironically, plumbers and electricians won’t be so easily replaced, a fact sometimes attributed to the Moravec paradox. Steven Pinker wrote in 1994 that cooks and gardeners are secured in their jobs for decades to come, unlike stock market analysis and engineers. But I suspect that the principle even extends within the academy: some work, like conducting actual experiments, is harder to automate than producing and running models. The theoretical work is likely more impacted by intelligence artificial than more applied, concrete work.

Note: This blog post was not written with artificial intelligence. Expect typos and grammatical mistakes.

How do you recognize an expert?

Go back to the roots: experience. An expert is someone who has repeatedly solved the concrete problem you are encountering. If your toilet leaks, an experienced plumber is an expert. An expert has a track record and has had to face the consequences of their work. Failing is part of what makes an expert: any expert should have stories about how things went wrong.

I associate the word expert with ‘the problem’ because we know that expertise does not transfer well: a plumber does not necessarily make a good electrician. And within plumbing, there are problems that only some plumbers should solve. Furthermore, you cannot abstract a problem: you can study fluid mechanics all you want, but it won’t turn you into an expert plumber.

That’s one reason why employers ask for relevant experience: they seek expertise they can rely on. It is sometimes difficult to acquire expertise in an academic or bureaucratic setting because the problems are distant or abstract. Your experience may not translate well into practice. Sadly we live in a society where we often lose track of and undervalue genuine expertise… thus you may take software programming classes from people who never built software or civil engineering classes from people who never worked on infrastructure projects.

So… how do you become an expert? Work on real problems. Do not fall for reverse causation: if all experts dress in white, dressing in white won’t turn you into an expert. Listening to the expert is not going to turn you into an expert. Lectures and videos can be inspiring but they don’t build your expertise. Getting a job with a company that has real problems, or running your own business… that’s how you acquire experience and expertise.

Why would you want to when you can make a good living otherwise, without the hard work of solving real problems? Actual expertise is capital that can survive a market crash or a political crisis. After Germany’s defeat in 1945… many of the aerospace experts went to work for the American government. Relevant expertise is robust capital.

Why won’t everyone seek genuine expertise? Because there is a strong countervailing force: showing a total lack of practical skill is a status signal. Wearing a tie shows that you don’t need to work with your hands.

But again: don’t fall for reverse causality… broadcasting that you don’t have useful skills might be fun if you are already of high status… but if not, it may not grant you a higher status.

And status games without a solid foundation might lead to anxiety. If you can get stuff done, if you can fix problems, you don’t need to worry so much about what people say about you. You may not like the color of the shoes of your plumber, but you won’t snob him over it.

So get expertise and maintain it. You are likely to become more confident and happier.

How quickly can you break a long string into lines?

Suppose that you receive a long string and you need to break it down into lines. Consider the simplified problems where you need to break the string into segments of (say) 72 characters. It is a relevant problem if your string is a base64 string or a Fortran formatted statement.

The problem could be a bit complicated because you might need consider the syntax. So the speed of breaking into a new line every 72 characters irrespective of the content provides an upper bound on the performance of breaking content into lines.

The most obvious algorithm could be to copy the content, line by line:

void break_lines(char *out, const char *in, size_t length,
  size_t line_length) {
  size_t j = 0;
  size_t i = 0;
  for (; i + line_length <= length; i += line_length) {
    memcpy(out + j, in + i, line_length);
    out[j+line_length] = '\n';
    j += line_length + 1;
  }
  if (i < length) {
    memcpy(out + j, in + i, length - i);
  }
}

Copying data in blocks in usually quite fast unless you are unlucky and you trigger aliasing. However, allocating a whole new buffer could be wasteful, especially if you only need to extend the current buffer by a few bytes.

A better option could thus be to do the work in-place. The difficulty is that if you load the data from the current array, and then write it a bit further away, you might be overwriting the data you need to load next. A solution is to proceed in reverse: start from the end… move what would be the last line off by a few bytes, then move the second last line and so forth. Your code might look like the following C function:

void break_lines_inplace(char *in, size_t length, size_t line_length) {
  size_t left = length % line_length;
  size_t i = length - left;
  size_t j = length + length / line_length - left;
  memmove(in + j, in + i, left);
  while (i >= line_length) {
    i -= line_length;
    j -= line_length + 1;
    memmove(in + j, in + i, line_length);
    in[j+line_length] = '\n';
  }
}

I wrote a benchmark. I report the results only for a 64KB input. Importantly, my numbers do not include memory allocation which is separate.

A potentially important factor is whether we allow function inlining: without inlining, the compiler does not know the line length at compile-time and cannot optimize accordingly.

Your results will vary, but here are my own results:

method Intel Ice Lake, GCC 12 Apple M2, LLVM 14
memcpy 43 GB/s 70 GB/s
copy 25 GB/s 40 GB/s
copy (no inline) 25 GB/s 37 GB/s
in-place 25 GB/s 38 GB/s
in-place (no inline) 25 GB/s 38 GB/s

In my case, it does not matter whether we do the computation in-place or not. The in-place approach generates more instructions but we are not limited by the number of instructions.

At least in my results, I do not see a large effect from inlining. In fact, for the in-place routine, there appears to be no effect whatsoever.

Roughly speaking, I achieve a bit more than half the speed as that of a memory copy. We might be limited by the number of loads and stores. There might be a clever way to close the gap.

Science and Technology links (April 13 2024)

      1. Our computer hardware exchange data using a standard called PCI Express. Your disk, your network and your GPU are limited by what PCI Express can do. Currently, it means that you are limited to a few gigabytes per second of bandwidth. PCI Express is about to receive a big performance boost in 2025 when PCI Express 7 comes out: it will support bandwidth of up to 512 GB/s. That is really, really fast. It does not follow that your disks and graphics are going to improve very soon, but it provides the foundation for future breakthroughs.
      2. Sperm counts are down everywhere and the trend is showing no sign of slowing down. There are indications that it could be related to the rise in obesity.
      3. A research paper by Burke et al. used a model to predict that climate change could reduce world GPD (the size of the economy) by 23%. For reference, world GDP grows at a rate of about 3% a year (+/- 1%) so that a cost of 23% is about equivalent to 7 to 8 years without growth. It is much higher than prior predictions. Barket (2024) questions these results:

        It is a complicated paper that makes strong claims. The authors use thousands of lines of code to run regressions containing over 500 variables to test a nonlinear model of temperature and growth for 166 countries and forecast economic growth out to the year 2100. Careful analysis of their work shows that they bury inconvenient results, use misleading charts to confuse readers, and fail to report obvious robustness checks. Simulations suggest that the statistical significance of their results is inflated. Continued economic growth at levels similar to what the world has experienced in recent years would increase the level of future economic activity by far more than Nordhaus’ (2018) estimate of the effect of warming on future world GDP. If warming does not affect the rate of economic growth, then the world is likely to be much richer in the future, with or without warming temperatures.

      4. The firm McKinsey reports finding statistically significant positive relations between the industry-adjusted earnings and the racial/ethnic diversity of their executives. Green and Hand (2024) fail to reproduce these results. They conclude: despite the imprimatur given to McKinsey’s studies, their results should not be relied on to support the view that US publicly traded firms can expect to deliver improved financial performance if they increase the racial/ethnic diversity of their executives.
      5. Corinth and Larrimore (2024) find that after adjusting for hours worked, Generation X and Millennials experienced a greater intergenerational increase in real market income than baby boomers.

Greatest common divisor, the extended Euclidean algorithm, and speed!

We sometimes need to find the greatest common divisor between two integers in software. The fastest way to compute the greatest common divisor might be the binary Euclidean algorithm. In C++20, it can be implemented generically as follows:

template <typename int_type>
int_type binary_gcd(int_type u, int_type v) {
  if (u == 0) { return v; }
  if (v == 0) { return u; }
  auto shift = std::countr_zero(u | v);
  u >>= std::countr_zero(u);
  do {
   v >>= std::countr_zero(v);
   if (u > v) { std::swap(u, v); }
   v = v - u;
  } while (v != 0);
  return u << shift;
}

The std::countr_zero function computes the “number of trailing zeroes” in an integer. A key insight is that this function often translates into a single instruction on modern hardware.

Its computational complexity is the number of bits in the largest of the two integers.

There are many variations that might be more efficient. I like an approach proposed by Paolo Bonzini which is simpler as it avoid the swap:

int_type binary_gcd_noswap(int_type u, int_type v) {
  if (u == 0) { return v; }
  if (v == 0) { return u; }
  auto shift = std::countr_zero(u | v);
  u >>= std::countr_zero(u);
  do {
   int_type t = v >> std::countr_zero(v);
   if (u > t) v = u - t, u = t;
   else v = t - u;
  } while (v != 0);
  return u << shift;
}

The binary Euclidean algorithm is typically faster than the textbook Euclidean algorithm which has to do divisions (a slow operation), although the resulting code is pleasantly short:

template <typename int_type>
int_type naive_gcd(int_type u, int_type v) {
  return (u % v) == 0 ? v : naive_gcd(v, u % v);
}

There are cases where the naive GCD algorithm is faster. For example, if v divides u, which is always the case when v is 1, then the naive algorithm returns immediately whereas the binary GCD algorithm might require many steps if u is large.

To balance the result, we can use a hybrid approach where we first use a division, as in the conventional Euclidean algorithm, and then switch to the binary approach:

template <class int_type> 
int_type hybrid_binary_gcd(int_type u, int_type v) {
  if (u < v) { std::swap(u, v); }
  if (v == 0) { return u; }
  u %= v;
  if (u == 0) { return v; }
  auto zu = std::countr_zero(u);
  auto zv = std::countr_zero(v);
  auto shift = std::min(zu, zv);
  u >>= zu;
  v >>= zv;
  do {
    int_type u_minus_v = u - v;
    if (u > v) { u = v, v = u_minus_v; }
    else {v = v - u; }
    v >>= std::countr_zero(u_minus_v);
  } while (v != 0);
  return u << shift;
}

I found interesting that there is a now a std::gcd function in the C++ standard library so you may not want to implement your own greatest-common-divisor if you are programming in modern C++.

For the mathematically inclined, there is also an extended Euclidean algorithm. It also computes the greatest common divisor, but also the Bézout coefficients. That is, given two integers a and b, it finds integers x and y such that x * a + y * b = gcd(a,b). I must admit that I never had any need for the extended Euclidean algorithm. Wikipedia says that it is useful to find multiplicative inverses in a module space, but the only multiplicative inverses I ever needed were computed with a fast Newton algorithm. Nevertheless, we might implement it as follows:

template <typename int_type> struct bezout {
  int_type gcd;
  int_type x;
  int_type y;
};

// computes the greatest common divisor between a and b,
// as well as the Bézout coefficients x and y such as
// a*x + b*y = gcd(a,b)
template <typename int_type>
bezout<int_type> extended_gcd(int_type u, int_type v) {
  std::pair<int_type, int_type> r = {u, v};
  std::pair<int_type, int_type> s = {1, 0};
  std::pair<int_type, int_type> t = {0, 1};
  while (r.second != 0) {
    auto quotient = r.first / r.second;
    r = {r.second, r.first - quotient * r.second};
    s = {s.second, s.first - quotient * s.second};
    t = {t.second, t.first - quotient * t.second};
  }
  return {r.first, s.first, t.first};
}

There is also a binary version of the extended Euclidean algorithm although it is quite a bit more involved and it is not clear that it is can be implemented at high speed, leveraging fast instructions, when working on integers that fit in general-purpose registers. It is may beneficial when working with big integers. I am not going to reproduce my implementation, but it is available in my software repository.

To compare these functions, I decided to benchmark them over random 64-bit integers. I found interesting that the majority of pairs of random integers (about two thirds) were coprime, meaning that their greatest common divisor is 1. Mathematically, we would expect the ratio to be 6/pi2 which is about right empirically. At least some had non-trivial greatest common divisors (e.g., 42954).

Computing the greatest common divisor takes hundreds of instructions and hundreds of CPU cycle. If you somehow need to do it often, it could be a bottleneck.

I find that the std::gcd implementation which is part of the GCC C++ library under Linux is about as fast as the binary Euclidean function I presented. I have not looked at the implementation, but I assume that it might be well designed. The version that is present on the C++ library present on macOS (libc++) appears to be the naive implementation. Thus there is an opportunity to improve the lib++ implementation.

The extended Euclidean-algorithm implementation runs at about the same speed as a naive regular Euclidean-algorithm implementation, which is what you would expect. My implementation of the binary extended Euclidean algorithm is quite a bit slower and not recommended. I expect that it should be possible to optimize it further.

function GCC 12 + Intel Ice Lake Apple LLVM + M2
std::gcd 7.2 million/s 7.8 million/s
binary 7.7 million/s 12 million/s
binary (no swap) 9.2 million/s 14 million/s
hybrid binary 12 million/s 17 million/s
extended 2.9 million/s 7.8 million/s
binary ext. 0.7 million/s 2.9 million/s

It may seem surprising that the extended Euclidean algorithm runs at the same speed as std::gcd on some systems, despite the fact that it appears to do more work. However, the computation of the Bézout coefficient along with the greatest common divisor is not a critical path, and can be folded in with the rest of the computation on a superscalar processor… so the result is expected.

My source code is available.

As part of the preparation of this blog post, I had initially tried writing a C++ module. It worked quite well on my MacBook. However, it fell part under Linux with GCC, so I reverted it back. I was quite happy at how using modules made the code simpler, but it is not yet sufficiently portable.

Credit: Thanks to Harold Aptroot for a remark about the probability of two random integers being prime.

Further reading: The libc++ library might update its std::gcd implementation.

A simple algorithm to compute the square root of an integer, byte by byte

A reader asked me for some help in computing (1 – sqrt(0.5)) to an arbitrary precision, from scratch. A simpler but equivalent problem is to compute the square root of an integer (e.g., 2). There are many sophisticated algorithms for such problems, but we want something relatively simple. We’d like to compute the square root bit by bit…

For example, the square root of two is…

  1. 5 / 4
  2. 11 / 8
  3. 22 / 16
  4. 45 / 32
  5. 90 / 64
  6. 181 / 128

More practically, 8-bit by 8-bit, we may want to compute it byte by byte…

  1. 362 / 256
  2. 92681 / 65536
  3. 23726566 / 16777216

How can we do so?

Intuitively, you could compute the integer part of the answer by starting with 0 and incrementing a counter like so:

x1 = 0
while (x1+1)**2 <= M:
  x1 += 1

Indeed, the square of the integer part cannot be larger than the desired power.

You can repeat the same idea with the fractional part… writing the answer as x1+x2/B+... smaller terms.

x2 = 0
while (x1*B + x2 + 1)**2 <= M*B**2:
  x2 += 1

It will work, but it involves squaring ever larger numbers. That is inefficient.

We don’t actually need to compute powers when iterating. If you need to compute x**2, (x+1)**2, (x+2)**2, etc. You can instead use a recursion: if you have computed (x+n)**2 and you need the next power, you just need to add 2(x+n) + 1 because that’s the value of (x+n+1)**2 (x+n)**2.

Finally, we get the following routine (written in Python). I left the asserts in place to make the code easier to understand:

B = 2**8 # or any other basis like 2 or 10
x = 0
power = 0
limit = M
for i in range(10): # 10 is the number of digits you want
  limit *= B**2
  power *= B**2
  x*=B
  while power + 2*x + 1 <= limit:
    power += 2*x + 1
    x += 1
    assert(x**2 == power)
    assert(x**2 <= limit)
# x/B**10 is the desired root 

You can simplify the code further by not turning the power variable into a local variable within the loop. We subtract it from the power variable.

B = 2**8
x = 0
limit = M
for i in range(10):
  limit *= B**2
  power = 0
  x*=B
  while power + 2*x + 1 <= limit:
    power += 2*x + 1
    x += 1
  limit -= power
# x/B**10 is the desired root 

The algorithm could be further optimized if you needed more efficiency. Importantly, it is assumed that the basis is not too large otherwise another type of algorithm would be preferable. Using 256 is fine, however.

Obviously, one can design a faster algorithm, but this one has the advantage of being nearly trivial.

Further reading: A Spigot-Algorithm for Square-Roots: Explained and Extended by Mayer Goldberg

Credit: Thanks to David Smith for inspiring this blog post.

C++ web app with Crow: early scalability results

Last year, I looked at writing small “hello world” web applications in various programming languages (Go, JavaScript, Nim…). Go, using nothing but the standard library, did well.

In these benchmarks, I am just programming an HTTP route that returns a small string (e.g., ‘hello world’). The query is from the host itself. The intent behind such a benchmark is to measure how well an web application might scale in the best of cases. I call such a benchmark ‘simplistic’ because nobody only ever returns just a short string and you do not usually query the server from the host.

At the time, I had wanted to compare with a C++ library, and I ended up trying the lithium framework which scaled very well.

Jake Arkinstall pointed out that he uses Crow, to build web applications in C++. So I decided to take Crow out on a spin.

My simplistic application has only few lines:

#include "crow.h"
int main() {
  crow::SimpleApp app;
  app.loglevel(crow::LogLevel::Warning);
  CROW_ROUTE(app, "/simple")([](){
    return "Hello world";
  });
  app.port(18080).multithreaded().run();
}

This allows the server to use all threads. You can limit the server to fewer threads by replacing multithreaded() by concurrency(32) to limit (e.g.) the server to 32 threads.

To build it, I use a standard CMakeLists.txt file:

cmake_minimum_required(VERSION 3.15)
project(funserver CXX)
find_package(Crow REQUIRED)
add_executable(server src/server.cpp)
target_link_libraries(server Crow::Crow)

And I use a conan file to specify the dependency:

[requires]
crowcpp-crow/1.1.0
[generators]
CMakeDeps
CMakeToolchain

That is all. Then to build, I issue the following commands in shell:

conan profile detect --force
conan install . --output-folder=build --build=missing
cmake -B build -DCMAKE_TOOLCHAIN_FILE=conan_toolchain.cmake -DCMAKE_BUILD_TYPE=Release
cmake --build build
./build/server

I assumes that you have C++ compiler, Conan and CMake, but these are standard tools.

After issuing these commands, my server is then running. I use bombardier to hammer the server with requests. On a Linux server with many processors (two  Intel Xeon Gold 6338 CPUs, each made of 32 cores) and much memory, I try increasing the number of simultaneous requests (using the tool bombardier) and looking for errors. As the number of simultaneous queries increase, the system has to sustain both a high number of requests and as well as the processing of the server. You can run my benchmark from my source code and instructions. Your numbers will differ.

simultaneous queries requests/s errors (%)
10 260k 0%
100 315k 0%
1000 380k 0%
10,000 350k 0.002%

I filed an issue with the Crow project regarding the errors. They are very uncommon and only occur under intense stress. They may or may not be the result of a bug.

My performance numbers are comparable to a lithium server. Let us rerun the same tests with lithium using 64 threads to verify:

simultaneous queries requests/s errors (%)
10 90k 0%
100 245k 0%
1000 275k 0%
10,000 240k 0%

Though lithium does not cause any errors even at high queries, it has troubles shutting down after being stressed with 10,000 simultaneous queries.

So it seems that Crow offers state-of-the-art performance. Crow offers HTTP/1.1 and WebSocket support, but it has currently no support for the more recent standards. It has a nice Web site.

Continue reading C++ web app with Crow: early scalability results

Science and Technology links (March 31 2024)

      1. Large language models (e.g., ChatGPT) do better at legal questions than lawyers: Our empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers, uncovering that advanced models match or exceed human accuracy in determining legal issues (Martin et al.).
      2. Gene therapy-mediated partial reprogramming extends lifespan and reverses age-related changes in aged mice.
      3. Increased vegetation greenness is called greening. Increased atmospheric CO2 is expected to lead to greening since CO2 is effectively a fertilizer. Global greening is a robust process that continued from 2001 to 2020 according to Chen et al. This confirms earlier studies which shows significant greening of the Earth (up 30% in a few decades).
      4. Babylonia was one of the earliest civilizations (established around 1900 BC). The Old Kingdom in Egypt began earlier, around 2500 BC. What was happening elsewhere? We do not really know, but, in America, Caral–Supe was a complex pre-Columbian era society that included as many as thirty major population centers in modern-day Peru. The civilization flourished between the fourth and second millennia BC, with the formation of the first city generally dated to around 3500 BC.
      5. University students have average intelligence according to Uttl et al. They suggest the following consequences: First, universities and professors
        need to realize that students are no longer extraordinary but merely average, and have to adjust curricula and academic standards. Second, employers can no longer rely on applicants with university degrees to be more capable or smarter than those without degrees. Third, students need
        to realize that acceptance into university is no longer an invitation to join an elite group.
      6. Multicellular organisms arose 1.6 billion years ago.
      7. Some stone tools are 1.4 million years old.
      8. Being overweight (abdominal adiposity) is associated with mental decline.
      9. The decimal point might have been invented by mathematician Giovanni Bianchini in the 1440s.
      10. Climate models have been found lacking in a study by Simpson et al. (2023). From the article…

        Water vapor in the atmosphere is expected to rise with warming because a warmer atmosphere can hold more moisture. However, over the last four decades, near-surface water vapor has not increased over arid and semi-arid regions. This may indicate a major model misrepresentation of hydroclimate-related processes.

        Statement by the National Center for Atmospheric Research (NCAR):

        The discrepancy between observations and models indicates a major gap in our understanding and modeling capabilities which could have severe implications for hydroclimate projections, including fire hazard, moving forward. Resolving this discrepancy is an urgent priority.

      11. Good looks are inherited from the parents and associated with substantially higher incomes.
      12. Researchers have rejuvenated the immune systems of mice.
      13. Environmental activism is associated with the dark triad traits (i.e., Machiavellianism, psychopathy, narcissism) and authoritarianism (e.g., top-down censorship).
      14. Solar power plants increase local temperatures.
      15. In countries where there is more equality between women and men, women are less likely to pursue careers in science or engineering.
      16. Among our ancestors, women did hunt occasionally. Nevertheless, there was a clear division of labor.
      17. Psychopaths may have poor planning and be unable to foresee and represent future consequences of their actions.
      18. Northern lights in Northern countries significantly warm the atmosphere, reducing the heating bills in Finland.
      19. Urban environments is where biodiversity is thriving.
      20. Oral contraceptives reduce the clitoral volume, they worsens the pain during intercourse and reduce the frequency of orgasm.
      21. Higher healthcare spending is not associated with greater longevity, but higher wealth is: a $10,000 increase in GDP per capita in a state is associated with 1.13 years more life expectancy.
      22. Asking people to sign an agreement which says they won’t cheat does not reduce cheating.
      23. The term ‘two-spirit’ used to describe a gender specific to indigenous North Americans was coined in the 1990s. Wikipedia warns: Two-spirit, as a term and concept, is neither used nor accepted universally in Native American cultures.
      24. For a long time, diabetes was described as a progressive disease. Once you had diabetes, it was viewed as incurable condition that could only be mitigated. At least in Canada, diabetes is now considered reversible: lifestyle choices (among other things) can reverse diabetes.
      25. Some people have a mutated protein RIMS1 and they exhibit much greater intelligence.
      26. One billion people in the world are obese. Intellectuals in the 1950s and 1960s predicted mass starvation. Most famously Paul R. Ehrlich predicted in his book “The Population Bomb” (1968) that we would see widespread famine and mass starvation. He predicted that 65 million Americans would die of starvation between 1980-1989. He later stated that “India couldn’t possibly feed two hundred million more people by 1980.” Today our greatest problem is not mass starvation in India, but rather the fact that their obesity curves are exponential.
      27. Brain volume has increased during the XXth century in human beings.
      28. Overall, the Antarctic ice shelf area has grown by 5305 km2since 2009. Or nearly 100 times Manhattan Island.
      29. Obesity seems to affect semen quality.
      30. You want to make sure that your ratio of Omega 6 to Omega 3 is low. Think about eating salmon, sardines, and walnuts.
      31. We are often told that modern media is pushing women to dislike their bodies. However, that effect might be exaggerated. There could be deeply rooted evolutionary reasons for women to be anxious about the appearance of their bodies.
      32. Instead of being rewarded, loyal employees are targeted by managers for exploitative practices.
      33. Overbearing parents may cause their children to become emotionally unstable. Overreactive parenting is related to decreases in child agreeableness and emotional stability. There might be wisdom in cultivating freedom.
      34. Oreo Cookies lowers LDL cholesterol more than high-intensity statin therapy.
      35. We found water on the surface of an asteroid.
      36. It is believed that low-dose aspirin has an anti-cancer effect. It might be due to its suppression of some platelets.

Fast and concise probabilistic filters in Python

Sometimes you need to filter out or filter in data quickly. Suppose that your employer maintains a list of forbidden passwords or URLs or words. You may store them in a relational database and query them as needed. Unfortunately, this process can be slow and inefficient.

A better approach might be to use a probabilistic filter. A probabilistic filter is a sort of ‘approximate set’. You can ask it whether a key is present in the set, and if it is present, then you will always get ‘true’ (the correct answer). However, when the key is not present, you may still get ‘true’, although with a low probability. So the probabilistic filter is sometimes wrong. Why would you accept a data structure that is sometimes wrong? Because it can be several times smaller and faster than querying directly the actual set.

The best known probabilistic filter is the Bloom filter, but there are many others. For example, we recently presented the binary fuse filters which are smaller and faster than Bloom filters, for large and immutable sets.

The pyxorfilter module in Python is an implementation of the binary fuse filters. It provides support for several filter types but both Fuse8 and Fuse16 are interesting if you have fairly large sets. They provide respectively a 0.39% and 0.0015% false probability rate. So a Fuse16 filter is almost nearly correct. Why would you prefer Fuse8? Because it uses half the memory.

We can construct a probabilistic filter in Python like so with the pyxorfilter filter:

from pyxorfilter import Fuse8
data = [uuid.uuid4() for i in range(2000000)]

filter = Fuse8(len(data))
filter.populate(data)

Once it is done, you can be certain that all the content of your initial set is in the filter:

for d in data:
  assert filter.contains(d)

You can save the filter on disk and check how much memory is used…

f = open(filename, 'wb')
f.write(filter.serialize())
f.close()
print(os.path.getsize(filename)*8.0/len(data))

If your set is large enough (say 1000,000 elements), you will find that the memory usage is about 9 bits per entry. It grows a bit larger per entry as the set gets smaller. For smaller sets, the pyxorfilter module offers an alternative (Xor8) that can be slightly more efficient in these cases.

How do you know if you can trust the filter? Just query random inputs (highly likely not to be present) and see how many falsely appear to be in the set:

# estimate false positive rate
N = 1000000
count = 0
for i in range(N):
count += filter.contains(uuid.uuid4())
fpp = count/N*100.0

As I already implied, if you replace Fuse8 by Fuse16, then the memory usage per element goes up to about 18 bits, but the false positive rate is far lower: 0.00200%.

I produced a small benchmark. On my laptop, I get that you get over 1 million queries per second (each time checking the presence of a string). On an Intel-based server, I get a lower number, so about half a million per second.

For binary fuse filters, it does not matter whether the element is in the set or not as far as performance goes, so I use random inputs. When using a Bloom filter (say), you would typically get worse performance when the elements are in the set.

The pyxorfilter  module was created Amey Narkhede. It still early days. I expect you can install pyxorfilter the usual way (pip install pyxorfilter) under x64 Linux. Unfortunately, Windows and other platfomrs, there are issues getting the module installed (see issue 10) since binary modules are not available for all platforms. The pip installer should try to install it from the source code. If you are a Python hacker, you can build it from source relatively easily:

git clone --recurse-submodules https://github.com/glitzflitz/pyxorfilter
cd pyxorfilter
python setup.py build_ext
python setup.py install

I am sure Amey would appreciate it if experienced Python hackers could help resolve the few remaining issues.

There are functionalities that pyxorfilter misses. Currently, you can save the filter to disk and recover it later. Sadly, to use it, you need to load the whole filter in memory. That is not needed. It might be more suitable to use memory file mapping or even other lower-level input-output operations.

What if you do not care about Python? You can use the xor and binary fuse filters in Go, C or C++, Zig, Rust, etc. I just love working in Python when I can.

Passing recursive C++ lambdas as function pointers

In modern C++, as in many popular languages, you can create ‘lambdas’. Effectively, they are potentially anonymous function instances that you can create on the fly as you are programming, possibly inside another function. The following is a simple example.

auto return1 = [](int n) -> int { return 1; };

What about recursive functions? At first I thought you could do…

auto fact = [](int n) -> int {
 if (n == 0) {
   return 1;
 } else {
  return n * fact(n - 1);
 }
};

Sadly it fails. What seems to be happening is that while it recognizes the variable ‘fact’ within the definition of ‘fact’, it cannot use it without knowing its type. So you should specify the type of the ‘fact’ right away. The following will work:

std::function<int(int)> fact = [](int n) -> int {
  if (n == 0) {
    return 1;
  } else {
    return n * fact(n - 1);
  }
};

But using std::function templates may add complexity. For example, what if you have a function that takes a function as a parameter without using std::function, such as…

void print(int(*f)(int)) {
  for(int k = 1; k < 10; ++k) {
   std::cout << "Factorial of " << k << " is " << f(k) << std::endl;
  }
}

Then you would want to call print(fact), but it will not work directly. It may complain like so:

No known conversion from 'std::function' to 'int (*)(int)

So let us avoid the std::function as much as possible:

int (*factorial)(int) = [](int n) -> int {
  if (n == 0) {
    return 1;
  } else {
    return n * factorial(n - 1);
  }
};

And then everything works out fine:

    print(factorial); // OK

Let me finish with a word of caution: functional programming is sophisticated, but it has downsides. One potential downside is performance. Let us consider this conventional code:

int factorialc(int n) {
  if (n == 0) {
    return 1;
  } else {
    return n * factorialc(n - 1);
  }
}
int functionc() {
  return factorialc(10);
}

Most compilers should produce highly optimized code in such a scenario. In fact, it is likely that the returned value of ‘functionc’ gets computed a compile time. The alternative using lambdas might look as follows:

int (*lfactorial)(int) = [](int n) -> int {
  if (n == 0) {
    return 1;
  } else {
    return n * lfactorial(n - 1);
  }
};

int functionl() {
  return lfactorial(10);
}

Though the results will depend on your system, I would expect far less efficient code in general.

Thus when programming in C++,  if you use lambdas in performance critical code, run benchmarks or disassemble your function to make sure that you have, indeed, zero-cost abstraction.

My source code is available.

Credit: Thanks to Ca Yi, Yagiz Nizipli and many X users for informing this post.

Further reading: Recursive lambdas from C++14 to C++23

Measuring your system’s performance using software (Go edition)

When programming software, we are working over an abstraction over a system. The computer hardware may not know about your functions, your variables, and your data. It may only see bits and instructions. Yet to write efficient software, the programmer needs to be aware of the characteristics of the underlying system. Thankfully, we can also use the software itself to observe the behavior of the system through experiments.

Between the software and the hardware, there are several layers such as the compilers, the operating system, and the hardware. A good programmer should take into account these layers when needed. A good programmer must also understand the behavior of their software in terms of these layers.

Benchmarks in Go

To measure the performance, we often measure the time required to execute some function. Because most functions are fast, it can be difficult to precisely measure the time that takes a function if we run it just once. Instead, we can run the function many times, and record the total time. We can then divide the total time by the number of executions. It can be difficult to decide how many times we should execute the function: it depends in part on how fast a function is. If a function takes 6 seconds to run, we may not want or need to run it too often. An easier strategy is to specify a minimum duration and repeatedly call a function until we reach or exceed the minimum duration.

When the function has a short execution time, we often call the benchmark a microbenchmark. We use microbenchmarks to compare different implementations of the same functionality or to better understand the system or the problem. We should always keep in mind that a microbenchmark alone cannot be used to justify a software optimization. Real-world performance depends on multiple factors that are difficult to represent in a microbenchmark.

Importantly, all benchmarks are affected by measurement errors, and by interference from the system. To make matters worse, the distribution of timings may not follow a normal distribution.

All programming languages provide the ability to run benchmarks. In Go, the tools make it easy to write benchmarks. You can import the testing package and create a function with the prefix Benchmark and a parameter of pointer type `testing.B. For example, the following program benchmarks the time required to compute the factorial of 10 as an integer:

package main

import (
    "fmt"
    "testing"
)

var fact int

func BenchmarkFactorial(b *testing.B) {
    for n := 0; n < b.N; n++ {
        fact = 1
        for i := 1; i <= 10; i++ {
            fact *= i
        }
    }
}

func main() {
    res := testing.Benchmark(BenchmarkFactorial)
    fmt.Println("BenchmarkFactorial", res)
}

If you put functions with such a signature (BenchmarkSomething(b *testing.B)) as part of your tests in a project, you can run them with the command go test -bench .. To run just one of them, you can specify a pattern such as go test -bench Factorial which would only run benchmark functions containing the word Factorial.

The b.N field indicates how many times the benchmark function runs. The testing package adjusts this value by increasing it until the benchmark runs for at least one second.

Measuring memory allocations

In Go, each function has its own ‘stack memory’. As the name suggests, stack memory is allocated and deallocated in a last-in, first-out (LIFO) order. This memory is typically only usable within the function, and it is often limited in size. The other type of memory that a Go program may use is heap memory. Heap memory is allocated and deallocated in a random order. There is only one heap shared by all functions.

With the stack memory, there is no risk that the memory may get lost or misused since it belongs to a specific function and can be reclaimed at the end of the function. Heap memory is more of a problem: it is sometimes unclear when the memory should be reclaimed. Programming languages like Go rely on a garbage collector to solve this problem. For example, when we create a new slice with the make function, we do not need to worry about reclaiming the memory. Go automatically reclaims it. However, it may still be bad for performance to constantly allocate and deallocate memory. In many real-world systems, memory management becomes a performance bottleneck.

Thus it is sometimes interesting to include the memory usage as part of the benchmark. The Go testing package allows you to measure the number of heap allocation made. Typically, in Go, it roughly corresponds to the number of calls to make and to the number of objects that the garbage collector must handle. The following extended program computers the factorial by storing its computation in dynamically allocated slices:

package main

import (
    "fmt"
    "testing"
)

var fact int

func BenchmarkFactorial(b *testing.B) {
    for n := 0; n < b.N; n++ {
        fact = 1
        for i := 1; i <= 10; i++ {
            fact *= i
        }
    }
}
func BenchmarkFactorialBuffer(b *testing.B) {
    for n := 0; n < b.N; n++ {
        buffer := make([]int, 11)
        buffer[0] = 1
        for i := 1; i <= 10; i++ {
            buffer[i] = i * buffer[i-1]
        }
    }
    b.ReportAllocs()
}

func BenchmarkFactorialBufferLarge(b *testing.B) {
    for n := 0; n < b.N; n++ {
        buffer := make([]int, 100001)
        buffer[0] = 1
        for i := 1; i <= 100000; i++ {
            buffer[i] = i * buffer[i-1]
        }
    }
    b.ReportAllocs()
}

func main() {
    res := testing.Benchmark(BenchmarkFactorial)
    fmt.Println("BenchmarkFactorial", res)
    resmem := testing.Benchmark(BenchmarkFactorialBuffer)
    fmt.Println("BenchmarkFactorialBuffer", resmem, resmem.MemString())
    resmem = testing.Benchmark(BenchmarkFactorialBufferLarge)
    fmt.Println("BenchmarkFactorialBufferLarge", resmem, resmem.MemString())
}

If you run such a Go program, you might get the following result:

BenchmarkFactorial 90887572             14.10 ns/op
BenchmarkFactorialBuffer 88609930               11.96 ns/op        0 B/op              0 allocs/op
BenchmarkFactorialBufferLarge     4408      249263 ns/op   802816 B/op         1 allocs/op

The last function allocates 802816 bytes per operation, unlike the first two. In this instance, if Go determines that data is not referenced after the function returns (a process called ‘escape analysis’), and if the amount of memory used is sufficiently small, it will avoid allocating the memory to the heap, preferring instead stack memory. In the case of the last function, the memory usage is too high, hence there is allocation on the heap rather than the stack.

Measuring memory usage

Your operating system provides memory to a running process in units of pages. The operating system cannot provide memory in smaller units than a page. Thus if you allocate memory in a program, it may either cost no additional memory if there are enough pages already; or it may force the operating system to provide more pages.

The size of a page depends on the operating system and its configuration. It can often vary between 4 kilobytes and 16 kilobytes although much larger pages are also possible (e.g., 1 gigabyte).

A page is a contiguous array of virtual memory addresses. A page may also represent actual physical memory. However, operating systems tend to only map used pages to physical memory. An operating system may provide a nearly endless supply of pages to a process, without ever mapping it to physical memory. Thus it is not simple to ask how much memory a program uses. A program may appear to use a lot of (virtual) memory, while not using much physical memory, and inversely.

The page size impacts both the performance and the memory usage. Allocating pages to a process is not free, it takes some effort. Among other things, the operating system cannot just reuse a memory page from another process as is. Doing so would be a security threat because you could have indirect access to the data stored in memory by another process. This other process could have held in memory your passwords or other sensitive information. Typically an operating system has to initialize (e.g., set to zero) a newly assigned page. Furthermore, mapping the pages to their actual physical memory also takes time. To accelerate the mapping process, modern systems often use a translation lookaside buffer to keep the map in the cache. When the translation lookaside buffer runs out of space a manual computation of the page mapping may be required, an expensive process. Thus large pages may improve the performance of some programs. However, large pages force the operating system to provide memory in larger chunks to a process, potentially wasting precious memory. You can write a Go program which prints out the page size of your system:

import (
    "fmt"
    "os"
)

func main() {
    pageSize := os.Getpagesize()
    fmt.Printf("Page size: %d bytes (%d KB)\n", pageSize, pageSize/1024)
}

Go makes it relatively easy to measure the number of pages allocated to a program by the operating system. Nevertheless, some care is needed. Because Go uses a garbage collector to free allocated memory, there might be a delay between the moment you no longer need some memory, and the actual freeing of the memory. You may force Go to call immediately the garbage collector with the function call runtime.GC(). You should rarely deliberately invoke the garbage collector in practice, but for our purposes (measuring memory usage), it is useful.

There are several memory metrics. In Go, some of the most useful are HeapSys and HeapAlloc. The first indicates how much memory (in bytes) has been given to the program by the operating system. The second value, which is typically lower indicates how much of that memory is actively in used by the program.

The following program allocates ever larger slices, and then ever smaller slices. In theory, the memory usage should first go up, and then go down:

package main

import (
    "fmt"
    "os"
    "runtime"
)

func main() {
    pageSize := os.Getpagesize()
    var m runtime.MemStats
    runtime.GC()
    runtime.ReadMemStats(&m)
    fmt.Printf(
        "HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
        float64(m.HeapSys)/1024.0/1024.0,
        float64(m.HeapAlloc)/1024.0/1024.0,
        float64(m.HeapSys)/float64(pageSize),
    )
    i := 100
    for ; i < 1000000000; i *= 10 {
        runtime.GC()
        s := make([]byte, i)
        runtime.ReadMemStats(&m)
        fmt.Printf(
            "%.3f MiB, HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
            float64(len(s))/1024.0/1024.0,
            float64(m.HeapSys)/1024.0/1024.0,
            float64(m.HeapAlloc)/1024.0/1024.0,
            float64(m.HeapSys)/float64(pageSize),
        )
    }
    for ; i >= 100; i /= 10 {
        runtime.GC()
        s := make([]byte, i)
        runtime.ReadMemStats(&m)
        fmt.Printf(
            "%.3f MiB, HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
            float64(len(s))/1024.0/1024.0,
            float64(m.HeapSys)/1024.0/1024.0,
            float64(m.HeapAlloc)/1024.0/1024.0,
            float64(m.HeapSys)/float64(pageSize),
        )
    }
    runtime.GC()
    runtime.ReadMemStats(&m)
    fmt.Printf(
        "HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
        float64(m.HeapSys)/1024.0/1024.0,
        float64(m.HeapAlloc)/1024.0/1024.0,
        float64(m.HeapSys)/float64(pageSize),
    )
}

The program calls os.Getpagesize() to get the underlying system’s memory page size in bytes as an integer, and assigns it to a variable pageSize. It declares a variable m of type runtime.MemStats, which is a struct that holds various statistics about the memory allocator and the garbage collector. The program repeatedly calls runtime.GC() to trigger a garbage collection cycle manually, which may free some memory and make it available for release. It calls runtime.ReadMemStats(&m) to populate the m variable with the current memory statistics. We can reuse the same variable m from call to call. The purpose of this program is to demonstrate how the memory usage of a Go program changes depending on the size and frequency of memory allocations and deallocations, and how the garbage collector and the runtime affect the memory release. The program prints the memory usage before and after each allocation, and shows how the m.HeapSys, m.HeapAlloc, and m.HeapSys / pageSize values grow and shrink accordingly.

If you run this program, you may observe that a program tends to hold on to the memory you have allocated and later released. It is partly a matter of optimization: acquiring memory takes time and we wish to avoid giving back pages only to soon request them again. It illustrates that it can be challenging to determine how much memory a program uses.

The program may print something like the following:

$ go run mem.go
HeapSys = 3.719 MiB, HeapAlloc =  0.367 MiB,  238.000 pages
0.000 MiB, HeapSys = 3.719 MiB, HeapAlloc =  0.367 MiB,  238.000 pages
0.001 MiB, HeapSys = 3.719 MiB, HeapAlloc =  0.383 MiB,  238.000 pages
0.010 MiB, HeapSys = 3.688 MiB, HeapAlloc =  0.414 MiB,  236.000 pages
0.095 MiB, HeapSys = 3.688 MiB, HeapAlloc =  0.477 MiB,  236.000 pages
0.954 MiB, HeapSys = 3.688 MiB, HeapAlloc =  1.336 MiB,  236.000 pages
9.537 MiB, HeapSys = 15.688 MiB, HeapAlloc =  9.914 MiB,  1004.000 pages
95.367 MiB, HeapSys = 111.688 MiB, HeapAlloc =  95.750 MiB,  7148.000 pages
953.674 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  954.055 MiB,  68332.000 pages
95.367 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  95.750 MiB,  68332.000 pages
9.537 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  9.914 MiB,  68332.000 pages
0.954 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  1.336 MiB,  68332.000 pages
0.095 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.477 MiB,  68332.000 pages
0.010 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.414 MiB,  68332.000 pages
0.001 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.383 MiB,  68332.000 pages
0.000 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.375 MiB,  68332.000 pages
HeapSys = 1067.688 MiB, HeapAlloc =  0.375 MiB,  68332.000 pages

Observe how, at the very beginning and at the very end, over a third of a megabyte of memory (238 pages) is repeated as being in used. Furthermore, over 68,000 pages remain allocated to the program at the very, even though no data structure remains in scope within the main function.

Inlining

One of the most powerful optimization technique that a compile may do is function inlining: the compiler brings some of the called functions directly into the calling functions.

Go makes it easy to tell which functions are inlined. We can also easily request that the compiles does not inline by adding the line //go:noinline right before a function.

Let us consider this program which contains two benchmarks were we sum all odd integers in a range.

package main

import (
    "fmt"
    "testing"
)

func IsOdd(i int) bool {
    return i%2 == 1
}

//go:noinline
func IsOddNoInline(i int) bool {
    return i%2 == 1
}

func BenchmarkCountOddInline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        sum := 0
        for i := 1; i < 1000; i++ {
            if IsOdd(i) {
                sum += i
            }
        }
    }
}

func BenchmarkCountOddNoinline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        sum := 0
        for i := 1; i < 1000; i++ {
            if IsOddNoInline(i) {
                sum += i
            }
        }
    }
}

func main() {
    res1 := testing.Benchmark(BenchmarkCountOddInline)
    fmt.Println("BenchmarkCountOddInline", res1)
    res2 := testing.Benchmark(BenchmarkCountOddNoinline)
    fmt.Println("BenchmarkCountOddNoinline", res2)
}

In Go, the flag -gcflags=-m tells the compiler to report the main optimizations it does. If you call this program simpleinline.go and compile it with the command go build -gcflags=-m simpleinline.go, you may see the following:

$ go build -gcflags=-m simpleinline.go
./simpleinline.go:8:6: can inline IsOdd
./simpleinline.go:21:12: inlining call to IsOdd
...

If you run the benchmark, you should see that the inlined version is much faster:

$ go run simpleinline.go
BenchmarkCountOddInline  3716786           294.6 ns/op
BenchmarkCountOddNoinline  1388792         864.8 ns/op

Inlining is not always beneficial: in some instances, it can generate large binaries and it may even slow down the software. However, when it is applicable, it can have a large beneficial effect.

Go tries as hard as possible to inline functions, but it has limitations. For example, compilers often find it difficult to inline recursive functions. Let benchmark two factorial functions, one that is recursive, and one that is not.

package main

import (
    "fmt"
    "testing"
)

var array = make([]int, 1000)

func Factorial(n int) int {
    if n < 0 {
        return 0
    }
    if n == 0 {
        return 1
    }
    return n * Factorial(n-1)
}

func FactorialLoop(n int) int {
    result := 1
    for i := 1; i <= n; i++ {
        result *= i
    }
    return result
}

func BenchmarkFillNoinline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        for i := 1; i < 1000; i++ {
            array[i] = Factorial(i)
        }
    }
}

func BenchmarkFillInline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        for i := 1; i < 1000; i++ {
            array[i] = FactorialLoop(i)
        }
    }
}

func main() {
    res1 := testing.Benchmark(BenchmarkFillNoinline)
    fmt.Println("BenchmarkFillNoinline", res1)
    res2 := testing.Benchmark(BenchmarkFillInline)
    fmt.Println("BenchmarkFillInline", res2)
    fmt.Println(float64(res1.NsPerOp()) / float64(res2.NsPerOp()))
}

Though both FactorialLoop and Factorial are equivalent, running this program, you should find that the non-recursive function (FactorialLoop) is much faster.

Cache line

Our computers read and write memory using small blocks of memory called “cache lines”. The cache line size is usually fixed and small (e.g.,  64 or 128 bytes). To attempt to measure the cache-line size, we may use a strided copy. From a large array, we copy every N bytes to another large array. We repeat this process N times. Thus if the original array contains a 1000 bytes, we always copy 1024 bytes, whether r N = 1, N = 2, N = 4, or N = 8.

When N is sufficiently large (say N = 16), the problem should be essentially memory bound: the performance is not limited by the number of instructions, but by the system’s ability to load and store cache lines. If N is larger than twice the cache line, then I can effectively skip one cache line out of two. If N is smaller than the cache line, then every cache line must be accessed. You expect a large stride to be significant faster.

One limitation to this approach is that processors may fetch more cache lines than needed so we may overestimate the size of the cache line. However, unless memory bandwidth is overly abundant, we should expect processors to try to limit the number of cache lines fetched.

Let us run an experiment. For each stride size, we repeat 10 times and record the maximum, the minimum and the average. Consider the following program.

package main

import (
    "fmt"
    "time"
)

const size = 33554432 // 32 MB
func Cpy(arr1 []uint8, arr2 []uint8, slice int) {
    for i := 0; i < len(arr1); i += slice {
        arr2[i] = arr1[i]
    }
}

func AverageMinMax(f func() float64) (float64, float64, float64) {
    var sum float64
    var minimum float64
    var maximum float64

    for i := 0; i < 10; i++ {
        arr1 = make([]uint8, size)
        arr2 = make([]uint8, size)

        v := f()
        sum += v
        if i == 0 || v < minimum {
            minimum = v
        }
        if i == 0 || v > maximum {
            maximum = v
        }
    }
    return sum / 10, minimum, maximum
}

var arr1 []uint8
var arr2 []uint8

func run(size int, slice int) float64 {
    start := time.Now()
    times := 10
    for i := 0; i < times*slice; i++ {
        Cpy(arr1, arr2, slice)
    }
    end := time.Now()
    dur := float64(end.Sub(start)) / float64(times*slice)
    return dur
}

func main() {
    for slice := 16; slice <= 4096; slice *= 2 {
        a, m, M := AverageMinMax(func() float64 { return run(size, slice-1) })
        fmt.Printf("%10d: %10.1f GB/s [%4.1f - %4.1f]\n", slice-1, float64(size)/a, float64(size)/M, float64(size)/m)
    }
}

We may get the following result:

$ go run cacheline.go                                  1
        15:       23.6 GB/s [21.3 - 24.4]
        31:       24.3 GB/s [23.8 - 24.5]
        63:       24.2 GB/s [23.6 - 24.6]
       127:       26.9 GB/s [23.8 - 27.9]
       255:       40.8 GB/s [37.8 - 43.6]
       511:      162.0 GB/s [130.4 - 203.4]
      1023:      710.0 GB/s [652.0 - 744.4]
      2047:      976.1 GB/s [967.1 - 983.8]
      4095:     1247.4 GB/s [1147.7 - 1267.0]

We see that the performance increases substantially when the stride goes from 127 to 255. It suggests that the cache line has 128 bytes. If you run this same benchmark on your own system, you may get a different result.

The results need to be interpreted with care: we are not measuring a copy speed of 1247.4 GB/s. Rather, we can copy large arrays at such a speed if we only copy one byte out of every 4095 bytes.

CPU Cache

When programming, we often do not think directly about memory. When we do consider that our data uses memory, we often think of it as homogeneous: memory is like a large uniform canvas upon which the computer writes and reads it data. However, your main memory (RAM) is typically buffered using a small amount of memory that resides close to the processor core (CPU cache). We often have several layers of cache memory (e.g., L1, L2, L3): L1 is is typically small but very fast whereas, for example, L3 is larger but slower.

You can empirically measure the effect of the cache. If you take a small array and shuffle it randomly, will be moving data primarily in the CPU cache, which is fast. If you take a larger array, you will move data in memory without much help from the cache, a process that is much slower. Thus shuffling ever larger arrays is a way to determine the size of your cache. It may prove difficult to tell exactly how many layers of cache you have and how large each layer is. However, you can usually tell when your array is significantly larger than the CPU cache.

We are going to write a random shuffle function: Shuffle(arr []uint32). It uses an algorithm called Fisher-Yates shuffle, which involves going through the array in reverse and swapping each element with another randomly chosen from those preceding it. The function uses a seed variable to generate random numbers from a mathematical formula. For our purposes, we use a simplistic number generator: we multiply the seed by the index. The function bits.Mul64 calculates the product of two 64-bit numbers and returns the result as two 32-bit numbers: the most significant (hi) and the least significant. The most significant value is necessarily between 0 and i (inclusively). We use this most significant value as the random index. The function then exchanges the elements using multiple assignment. We call this shuffle function several times, on inputs of different sizes. We report the time normalized by the size of the input.

package main

import (
    "fmt"
    "math/bits"
    "time"
)

func Shuffle(arr []uint32) {
    seed := uint64(1234)
    for i := len(arr) - 1; i > 0; i-- {
        seed += 0x9E3779B97F4A7C15
        hi, _ := bits.Mul64(seed, uint64(i+1))
        j := int(hi)
        arr[i], arr[j] = arr[j], arr[i]
    }
}

func AverageMinMax(f func() float64) (float64, float64, float64) {
    var sum float64
    var minimum float64
    var maximum float64

    for i := 0; i < 10; i++ {
        v := f()
        sum += v
        if i == 0 || v < minimum {
            minimum = v
        }
        if i == 0 || v > maximum {
            maximum = v
        }
    }
    return sum / 10, minimum, maximum
}

func run(size int) float64 {
    arr := make([]uint32, size)

    for i := range arr {
        arr[i] = uint32(i + 1)
    }
    start := time.Now()
    end := time.Now()
    times := 0
    for ; end.Sub(start) < 100_000_000; times++ {
        Shuffle(arr)
        end = time.Now()
    }
    dur := float64(end.Sub(start)) / float64(times)
    return dur / float64(size)
}

func main() {
    for size := 4096; size <= 33554432; size *= 2 {
        fmt.Printf("%20d KB ", size/1024*4)
        a, m, M := AverageMinMax(func() float64 { return run(size) })
        fmt.Printf(" %.2f [%.2f, %.2f]\n", a, m, M)
    }
}

A possible output of running this program might be:

⚡  go run cache.go 
                  16 KB  0.70 [0.66, 0.93]
                  32 KB  0.65 [0.64, 0.66]
                  64 KB  0.64 [0.64, 0.66]
                 128 KB  0.64 [0.64, 0.67]
                 256 KB  0.65 [0.64, 0.66]
                 512 KB  0.70 [0.70, 0.71]
                1024 KB  0.77 [0.76, 0.79]
                2048 KB  0.83 [0.82, 0.84]
                4096 KB  0.87 [0.86, 0.90]
                8192 KB  0.92 [0.91, 0.95]
               16384 KB  1.10 [1.06, 1.24]
               32768 KB  2.34 [2.28, 2.52]
               65536 KB  3.90 [3.70, 4.25]
              131072 KB  5.66 [4.80, 9.78]

We see between 16 KB and 16384 KB, the time per element shuffle does not increase much even though we repeatedly double the input size. However, between 16384 KB and 32768 KB, the time per element doubles. And then it consistently doubles each time the size of the array doubles. It suggests that the size of the CPU cache is about 16384 KB in this instance.

Memory bandwidth

You can only read and write memory up to a maximal speed. It can be difficult to measure such limits. In particular, you may need several cores in a multi-core system to achieve the best possible memory. For simplicity, let us consider maximal read memory.

Many large systems do not have a single bandwidth number. For example, many large systems rely on NUMA: NUMA stands for Non-Uniform Memory Access. In a NUMA system, each processor has its own local memory, which it can access faster than the memory of other processors.

The bandwidth also depends to some extend on the amount of memory requested. If the memory fits in CPU cache, only the first access may be expensive. A very large memory region may not fit in RAM and may require disk storage. Even if it fits in RAM, an overly large memory region might require many memory pages, and accessing all of them may cause page walking due to the limits of the translation lookaside buffer.

If the memory is accessed at random locations, it might be difficult for the system to sustain a maximal bandwidth because the system cannot predict easily where the next memory load occurs. To get the best bandwidth, you may want to access the memory linearly or according to some predictable pattern.

Let us consider the following code:

package main

import (
    "fmt"
    "time"
)

func run() float64 {
    bestbandwidth := 0.0
    arr := make([]uint8, 2*1024*1024*1024) // 4 GB
    for i := 0; i < len(arr); i++ {
        arr[i] = 1
    }
    for t := 0; t < 20; t++ {
        start := time.Now()
        acc := 0
        for i := 0; i < len(arr); i += 64 {
            acc += int(arr[i])
        }
        end := time.Now()
        if acc != len(arr)/64 {
            panic("!!!")
        }
        bandwidth := float64(len(arr)) / end.Sub(start).Seconds() / 1024 / 1024 / 1024
        if bandwidth > bestbandwidth {
            bestbandwidth = bandwidth
        }
    }
    return bestbandwidth
}

func main() {
    for i := 0; i < 10; i++ {
        fmt.Printf(" %.2f GB/s\n", run())
    }
}

The code defines two functions: run and main. The main function is the entry point for the program, and it calls the run function 10 times, printing the result each time. The run function is a custom function that measures the memory bandwidth of the system. It does this by performing the following steps:

It declares a variable called bestbandwidth and initializes it to 0.0. This variable stores the highest bandwidth value obtained during the execution of the function. It creates a slice of bytes (uint8) called arr, with a length equivalent to 4 GB. The slice is initialized with 1s. The loop will only access every 64th element of the slice, skipping the rest. Given that most systems have a cache-line size of 64 bytes or more, it is enough to touch each cache line. It calculates the bandwidth by dividing the size of the slice (in bytes) by the difference between the end and start times (in seconds), and then dividing by 1024 three times to convert the result to gigabytes per second (GB/s). The code repeats the measurement 20 times and returns the best result, to account for possible variations in the system performance. The code prints the result 10 times, to show the consistency of the measurement.

Memory latency and parallelism

Latency is often described as the time delay between the beginning of a request and the moment when you are served. Thus if you go to a restaurant, the latency you might be interested in is the time it will take before you can start eating. The latency is distinct from the throughput: a restaurant might be able to serve hundreds of customers at once, but still have high latency (long delays for each customer). If you put a lot of data on a very large disk, you can put this disk in a truck and drive the truck between two cites. It could represent a large bandwidth (much data is moved per unit of time), but the latency could be quite poor (hours). Similarly, you could shine a laser at your partner when supper is ready: the information could arrive without much delay even if you are very far away, but you are communicating little information (low throughput). One way to express this trade-off between latency and throughput is with Little’s Law: L = λW where L is the average number of elements in the system, λ is the throughput (long-term average arrival rate of new elements), and W is the latency, or the average amount of time that elements spend waiting. Thus if you want to have L customers at all times in your restaurant, and fewer customers arrive, you should serve the customers with greater delays. And so forth. Little’s law work with our memory subsystems as well: computers can sustain a maximum number of memory requests, each memory request has a latency, and there is an overall bandwidth. If latency does not improve, we can still improve bandwidth or throughput by increasing the number of requests that can be sustained concurrently. Unfortunately, system designers are often forced to make this choice, and so it is not common to see stagnant or worsening memory latencies despite fast improving memory bandwidths. A common illustration of the concept of memory latency is the traversal of a linked list. In computer science, a linked list is a data structure made of nodes, and each node is linked (by a pointer) to the next node. The nodes may not be laid out in memory consecutively, but even if they are, accessing each successive node requires a at least a small delay. On current processors, it can often take at least 3 cycles to load data from memory, even if the memory is in cache. Thus determining the length of the list by traversing the whole linked list can take time, and most of this time is just made of the successive delays. The following code benchmarks the time required to traverse a linked list made of a million nodes. Though the time varies depending on your system, it may represent a sizeable fraction of a millisecond.

package main

import (
    "fmt"
    "testing"
)

type Node struct {
    data int
    next *Node
}

func build(volume int) *Node {
    var head *Node
    for i := 0; i < volume; i++ {
        head = &Node{i, head}
    }
    return head
}

var list *Node
var N int

func BenchmarkLen(b *testing.B) {
    for n := 0; n < b.N; n++ {
        len := 0
        for p := list; p != nil; p = p.next {
            len++
        }
        if len != N {
            b.Fatalf("invalid length: %d", len)
        }
    }
}

func main() {
    N = 1000000
    list = build(N)
    res := testing.Benchmark(BenchmarkLen)
    fmt.Println("milliseconds: ", float64(res.NsPerOp())/1e6)

    fmt.Println("nanoseconds per el.", float64(res.NsPerOp())/float64(N))
}

In this code, a Node struct is defined with two fields: data is an integer representing the value stored in the node, `next is a pointer to the next node in the linked list. We could also add a pointer to the previous node, but that is not necessary in our case. The build function creates a singly linked list of nodes from an integer volume as an argument. It initializes an empty linked list (head is initially nil). It iterates from 0 to volume-1, creating a new node with value i and pointing its next to the current head. The new node becomes the new head. The function returns the final head of the linked list. The main function initializes two global variables (list and N) storing respectively the head of the list and the expected length. These values h are used by the BenchmarkLen function. This code demonstrates how to create a linked list, calculate its length, and benchmark the performance of the length calculation. Our length computation is almost entirely bounded (limited) by the memory latency, the time it takes to access the memory. The computations that we are doing (comparisons, increments) are unimportant to the performance. To illustrate our observation, we can try traversing two linked lists simultaneously, as in this example:

package main

import (
    "fmt"
    "testing"
)

type Node struct {
    data int
    next *Node
}

func build(volume int) *Node {
    var head *Node
    for i := 0; i < volume; i++ {
        head = &Node{i, head}
    }
    return head
}

var list1 *Node
var list2 *Node

var N int

func BenchmarkLen(b *testing.B) {
    for n := 0; n < b.N; n++ {
        len := 0
        for p1, p2 := list1, list2; p1 != nil && p2 != nil; p1, p2 = p1.next, p2.next {
            len++
        }
        if len != N {
            b.Fatalf("invalid length: %d", len)
        }
    }
}

func main() {
    N = 1000000
    list1 = build(N)
    list2 = build(N)

    res := testing.Benchmark(BenchmarkLen)
    fmt.Println("milliseconds: ", float64(res.NsPerOp())/1e6)

    fmt.Println("nanoseconds per el.", float64(res.NsPerOp())/float64(N))
}

If you run this new code, you might find that the benchmark results are close to the single-list ones. It is not surprising: the processor is mostly just waiting for the next node, and waiting for two nodes is not much more expensive. For this reason, when programming, you should limit memory accesses as much as possible. Use simple arrays when you can instead of linked lists or node-based tree structures. We would would like to work with arbitrarily large data structures, so that we can stress the memory access outside of the cache. Sattolo’s algorithm is a variant of the well-known random shuffle that generates a random cyclic permutation of an array or list. Sattolo’s algorithm ensures that the data is permuted using a single cycle. That is, starting with one element in a list of size n, we find that this element is moved to another position, which is itself moved to another position, and so forth, until after n moves, we end up back at our initial position. To apply Sattolo’s algorithm, given an array or list of elements, we start with an index i from 0 to n-1, where n is the length of the array. For each index i, we choose a random index j such that i < j < n. We swap the elements at indices i and j. E.g., suppose we have an array [0, 1, 2, 3, 4]. The algorithm might produce a cyclic permutation like [2, 0, 3, 1, 4]. With this algorithm, we can visit all values in an array exactly once in random order. From an array contain indexes 0 to n-1 permuted with Sattolo’s algorithm, we first load the first element, read its value, move to the corresponding index, and so forth. After `n operation, we should come back at the initial position. Because each operation involves a memory load, it is limited by memory latency. We can try to go faster with memory-level parallelism: we can pick k positions spread out in the cycle and move from these k initial positions n/k times through the cycle. Because computers can load many values in parallel, this algorithm should be faster for larger values of k. However, as k increases, we may see fewer and fewer gains because systems have limited memory-level parallelism and bandwidth.The following program implements this idea.

package main

import (
    "fmt"
    "math/rand"
    "time"
)

// makeCycle creates a cycle of a specified length starting at element 0
func makeCycle(length int) ([]uint64, []uint64) {
    array := make([]uint64, length)
    index := make([]uint64, length)
    // Create a cycle of maximum length within the big array
    for i := 0; i < length; i++ {
        array[i] = uint64(i)
    }

    // Sattolo shuffle
    for i := 0; i+1 < length; i++ {
        swapIdx := rand.Intn(length-i-1) + i + 1
        array[i], array[swapIdx] = array[swapIdx], array[i]
    }

    total := 0
    cur := uint64(0)
    for cur != 0 {
        index[total] = cur
        total++
        cur = array[cur]
    }
    return array, index
}

// setupPointers sets up pointers based on the given index
func setupPointers(index []uint64, length, mlp int) []uint64 {
    sp := make([]uint64, mlp)
    sp[0] = 0

    totalInc := 0
    for m := 1; m < mlp; m++ {
        totalInc += length / mlp
        sp[m] = index[totalInc]
    }
    return sp
}

func runBench(array []uint64, index []uint64, mlp int) time.Duration {
    length := len(array)
    sp := setupPointers(index, length, mlp)
    hits := length / mlp
    before := time.Now()
    for i := 0; i < hits; i++ {
        for m := 0; m < mlp; m++ {
            sp[m] = array[sp[m]]
        }
    }
    after := time.Now()
    return after.Sub(before)
}

func main() {
    const length = 100000000
    array, index := makeCycle(length)
    fmt.Println("Length:", length*8/1024/1024, "MB")
    base := runBench(array, index, 1)
    fmt.Println("Lanes:", 1, "Time:", base)

    for mlp := 2; mlp <= 40; mlp++ {
        t := runBench(array, index, mlp)
        fmt.Println("Lanes:", mlp, "Speedup:", fmt.Sprintf("%.1f", float64(base)/float64(t)))
    }
}

The function makeCycle creates a cycle of a specified length starting at element 0. It initializes two slices: array and index, both of type []uint64. The array slice represents the elements in the cycle. The index slice stores the indices of the elements in the cycle, so that we can more easily access a position in the cycle. The function performs the following steps. It initializes array with values from 0 to length-1. It applies Sattolo’s shuffle algorithm to the array to create a random permutation. The function returns both array and index. The function setupPointers: the function calculates the increment value (totalInc) based on the length and the number of lanes (mlp). It assigns the indices from index to sp based on the calculated increments. The function runBench benchmarks the execution time for a given number of lanes (mlp). It initializes a slice sp using setupPointers. The function iterates through the pointers in sp and updates them by following the indices in array`. It measures the execution time and returns it as a time.Duration instance. The main function first computes the running time for 1 lane, and then it reports the gains when using multiple lanes. Overall, this code generates a cycle of specified length, sets up pointers, and benchmarks the execution time for different numbers of lanes. The primary purpose seems to be exploring parallelization using multiple lanes. The runBench function simulates parallel execution by updating pointers concurrently. The speedup is calculated by comparing the execution time for different numbers of lanes. The larger the speedup, the more efficient the memory-level parallel execution. The general principle is that you can often improve the performance of a system that faces high latencies by breaking the data dependencies. Instead of putting all your data in a long chain, try to break to have no chain at all or, if you must have chains, use several smaller chains.

Superscalarity and data dependency

Most current processors are superscalar (as opposed to ‘scalar’), meaning that they can execute and retire several instructions per CPU cycles. That is, even if you have a single CPU core, there is much parallelism involved. Some processors can retire 8 instructions per cycle or more. Not all code routines benefit equally from superscalar execution. Several factors can limit your processors to few instructions per cycle. Having to wait on memory accesses is one such factor. Another common factor is data dependency: when the next instruction depends on the result of a preceding instruction… it may have to wait before it starts executing. To illustrate consider functions that compute successive differences between elements of an array (e.g., given 5,7,6, you might get the initial value 5 followed by 2 and -1), and the reverse operation which sums up all the differences to recover the original value. You may implement these functions as such:

func successiveDifferences(arr []int) {
    base := arr[0]
    for i := 1; i < len(arr); i++ {
        base, arr[i] = arr[i], arr[i]-base
    }
}

func prefixSum(arr []int) {
    for i := 1; i < len(arr); i++ {
        arr[i] = arr[i] + arr[i-1]
    }
}

Assuming that the compiler does not optimize these functions in a non-trivial manner (e.g., using SIMD instructions), we can reason relatively simply about the performance. For the successive differences, we need approximately one subtraction per element in the array. For the prefix sum, you need approximately one addition per element in the array. It looks quite similar at a glance. However, the data dependency is different. To compute the difference between any two values in the array, you do not need to have computed the preceding differences. However, the prefix sum, as we implemented it, requires us to have computed all preceding sums before the next can be computed. Let us write a small benchmarking program to test the performance difference:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func successiveDifferences(arr []int) {
    base := arr[0]
    for i := 1; i < len(arr); i++ {
        base, arr[i] = arr[i], arr[i]-base
    }
}

func prefixSum(arr []int) {
    for i := 1; i < len(arr); i++ {
        arr[i] = arr[i] + arr[i-1]
    }
}

var array []int

func BenchmarkPrefixSum(b *testing.B) {
    for n := 0; n < b.N; n++ {
        prefixSum(array)
    }
}

func BenchmarkSuccessiveDifferences(b *testing.B) {
    for n := 0; n < b.N; n++ {
        successiveDifferences(array)
    }
}

func main() {
    array = make([]int, 100)
    for i := range array {
        array[i] = rand.Int()
    }
    res2 := testing.Benchmark(BenchmarkSuccessiveDifferences)
    fmt.Println("BenchmarkSuccessiveDifferences", res2)
    res1 := testing.Benchmark(BenchmarkPrefixSum)
    fmt.Println("BenchmarkPrefixSum", res1)

}

Your result will vary depending on your system. However, you should not be surprised if the prefix sum takes more time. On an Apple system, we go the following results:

BenchmarkSuccessiveDifferences 39742334         30.04 ns/op
BenchmarkPrefixSum  8307944            142.8 ns/op

The prefix sum can be several times slower, even though it appears at a glance that it should use a comparable number of instructions. In general, you cannot trust a hasty analysis. Just because two functions appear to do a similar amount of work, does not mean that they have the same performance. Several factors must be taken into account, including data dependencies.

Branch prediction

In part because the processors are multiscalar, they have been designed to execute speculatively: when facing a branch, the processor tries to guess the direction that will be taken, and it begins the computation optimistically. When the processor makes the correct prediction, it usually improves the performance, sometimes by a large amount. However, when the processor is unable to predict accurately the branch, branch prediction may become a net negative. Indeed, when the branch is mispredicted, the processor may have to restart the computation from the point where it made the wrong prediction, an expensive process that can waste several CPU cycles. To illustrate, let us first consider a function that copies the content of an slice into another slice of the same size:

func Copy(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] = v
    }
}

A more sophisticated function may copy only the odd elements:

unc CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

We may try to copy an array that contains random integers (both odd and even), only odd integers, or only even integers. The following program illustrates:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func Copy(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] = v
    }
}

func CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

var array []uint
var dest []uint

func BenchmarkCopyOdd(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOdd(dest, array)
    }
}

func BenchmarkCopy(b *testing.B) {
    for n := 0; n < b.N; n++ {
        Copy(dest, array)
    }
}

func main() {
    array = make([]uint, 10000)
    dest = make([]uint, len(array))

    for i := range array {
        array[i] = uint(rand.Uint32())
    }
    res0 := testing.Benchmark(BenchmarkCopy)
    fmt.Println("BenchmarkCopy (random)", res0)
    res1 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (random)", res1)
    for i := range array {
        array[i] = uint(rand.Uint32()) | 1
    }
    res2 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (odd data)", res2)
    for i := range array {
        array[i] = uint(rand.Uint32()) &^ 1
    }
    res3 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (even data)", res3)
}

On an Apple system, we got the following results:

BenchmarkCopy (random)   414158       2936 ns/op
BenchmarkCopyOdd (random)    55408           19518 ns/op
BenchmarkCopyOdd (odd data)   402670          2975 ns/op
BenchmarkCopyOdd (even data)   402738         2896 ns/op

The last three timings involve the same function, only the input data differs. We find that all timings are similar in this case, except for benchmark that copies random data: it is several times slower in our tests. The much longer running time is due to the presence of an unpredictable branch in our inner loop. Observe that the same function, subject to the same volume of data, can have vastly different performance characteristics, even though the computational complexity of the function does not change: in all instances, we have linear time complexity. If we expect our data to lead to poor branch prediction, we may reduce the number of branches in the code. The resulting code might be nearly branch free or branchless. For example, we can use an arithmetic and logical expression to replace a condition copy:

func CopyOddBranchless(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] ^= uint(-(v & 1)) & (v ^ dest[i])
    }
}

Let us review the complicated expression:

  • v & 1: This operation checks if the least significant bit of v is set (i.e., if v is odd).
  • -(v & 1): This negates the result of the previous operation. If v is odd, this becomes -1; otherwise, it becomes 0. However, -1 as an unsigned integer is becomes the maximal value, the one with all of the bits set to 1.
  • v ^ dest[i]: This XORs the value of v with the corresponding element in the dest slice.
  • uint(-(v & 1)) & (v ^ dest[i]): If v is odd, it returns the XOR of v with dest[i]; otherwise, it returns 0.
  • Finally, dest[i] ^= uint(-(v & 1)) & (v ^ dest[i]) leaves dest[i] unchanged if v is even, otherwise it replaces with v using the fact that dest[i] ^ (v ^ dest[i]) == v.

We can put this function to good use in a benchmark:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

func CopyOddBranchless(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] ^= uint(-(v & 1)) & (v ^ dest[i])
    }
}

var array []uint
var dest []uint

func BenchmarkCopyOdd(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOdd(dest, array)
    }
}

func BenchmarkCopyOddBranchless(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOddBranchless(dest, array)
    }
}
func main() {
    array = make([]uint, 10000)
    dest = make([]uint, len(array))
    for i := range array {
        array[i] = uint(rand.Uint32())
    }
    res1 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (random)", res1)
    res2 := testing.Benchmark(BenchmarkCopyOddBranchless)
    fmt.Println("BenchmarkCopyOddBranchless (random)", res2)
}

On an Apple system, we got:

BenchmarkCopyOdd (random)    60782           19254 ns/op
BenchmarkCopyOddBranchless (random)   166863          7124 ns/op

In this test, the branchless approach is much faster. We should stress that it is not always the case that branchless code is faster. In fact, we observe that in our overall test results, the branchless function is significantly slower than the original when the results are predictable (e.g., 2896 ns/op vs 7124 ns/op). In actual software, you should try to recognize where you have poorly predicted branches and act in these cases to see if a branchless approach might be faster. Thankfully, most branches are well predicted in practice in most projects.

How to read files quickly in JavaScript

Suppose you need to read several files on a server using JavaScript. There are many ways to read files in JavaScript with a runtime like Node.js. Which one is best? Let us consider the various approaches.

Using fs.promises

const fs = require('fs/promises');
const readFile = fs.readFile;
readFile("lipsum.txt", { encoding: 'utf-8' })
.then((data) => {...})
.catch((err) => {...})

Using fs.readFile and util.promisify

const fs = require('fs');
const util = require('util');
const readFile = util.promisify(fs.readFile);
readFile("lipsum.txt", { encoding: 'utf-8' })
.then((data) => {...})
.catch((err) => {...})

Using fs.readFileSync

const fs = require('fs');
const readFileSync = fs.readFileSync;
var data = readFileSync("lipsum.txt", { encoding: 'utf-8' })

Using await fs.readFileSync

const fs = require('fs');
const readFileSync = fs.readFileSync;
async function f(name, options) {
  return readFileSync(name, options);
}

Using fs.readFile

const fs = require('fs');
const readFile = fs.readFile;
fs.readFile('lipsum.txt', function read(err, data) {...});

Benchmark

I wrote a small benchmark where I repeated read a file from disk. It is a simple loop where the same file is accessed each time. I report the number of milliseconds needed to read the file 50,000 times.  The file is relatively small (slightly over a kilobyte). I use a large server with dozens of Ice Lake Intel cores and plenty of memory. I use Node.js 20.1 and Bun 1.0.14. Bun is a competing JavaScript runtime.

I ran the benchmarks multiple times, and I report the best results in all cases. Your results will differ.

time (Node.js) time (Bun)
fs.promises 2400 ms 110 ms
fs.readFile and util.promisify 1500 ms 180 ms
fs.readFileSync 140 ms 140 ms
await fs.readFileSync 220 ms 180 ms
fs.readFile 760 ms 90 ms

At least on my system, in this test, the fs.promises is significantly more expensive than anything else when using Node.js. The Bun runtime is much faster than Node.js in this test.

The results are worse than they appear for fs.promises in the following sense. I find that readFileSync uses 300 ms of CPU time whereas fs.promises uses 7 s of CPU time. That is because fs.promises triggers work on several cores during the benchmark.

Increasingly the file size to, say, 32kB, does not change the conclusion. If you go to significantly larger files, many of the Node.js cases fail with “heap limit Allocation failed”. Bun keeps going even with large files. The test results do not change the conclusion with Bun: fs.readFile is consistently faster in my tests, even for large files.

Credit. My benchmark is inspired by a test case provided by Evgenii Stulnikov.

How many political parties rule Canada? Fun with statistics

Canada has several political parties with elected member of parliament: the Liberals, the Conservatives, the Bloc Québecois, de NDP and the Green. But do they behave as distinct political parties when voting, or are they somehow aligned?

Voting data for the member of parliament in Canada is easily accessible as JSON or XML. Thus I wrote a little Python script to compute, for each vote, what percentage of each party voted yea. I use the latest 394 votes. It turns out that, overwhelming, the percentage is either 0% or 100%. So individual members of parliament are not relevant, only caucuses matter.

We can first examine Pearson’s correlation between how the different parties vote:

Conserv. Liberal NDP Bloc Québécois Green Party
Conserv. 1 -0.5 -0.5 -0.1 -0.2
Liberal 1 0.8 0.4 0.5
NDP 1 0.4 0.6
Bloc Québécois 1 0.5
Green Party 1

We observe that there is excellent correlation between the ruling party (Liberal) and the NDP, and to a lesser extend to the Bloc Québécois (0.4) and Green (0.5). The Conservatives are anti-correlated with everyone else, although they less anti-correlated with the Bloc Québécois and the Green than with other parties (Liberal and NDP).

Though there are dozens of votes, you can capture 85% of the variance by using only two dimensions with a principal component analysis. In effect, you create two fictional voting events (that are weighted combinations of the votes) that represent most accurately the stances of the various parties.

The result demonstrates that four of the Canadian political parties are clustered, meaning that they vote similarly, while one party (the Conservatives) is clearly distinct in its voting patterns.

My source code is available. It is made of two simple Python files that you can run yourself. I encourage you to run your own analysis. My work can be extended to include more data.

Book review: Theft of Fire by Devon Eriksen

When I was young, science fiction was the genre of choice for many engineers and scientists. But the genre declined significantly in recent years. Part of the problem is the rise dystopian fiction. In the imagined future, we are no longer conquering space or developing new fantastic technologies, but rather, increasingly, battling the consequences of climate change or of some evil corporation. In some respect, science fiction is always about the present and present has become unimaginative, fearful and generally anxious.

To illustrate, let me quote a recent piece in the Atlantic:

The United States is experiencing an extreme teenage mental-health crisis. From 2009 to 2021, the share of American high-school students who say they feel “persistent feelings of sadness or hopelessness” rose from 26 percent to 44 percent, according to a new CDC study. This is the highest level of teenage sadness ever recorded.

Civilizations are not eternal, they have a life cycle. When young people grow increasingly depressed, when we are more eager to take down than to build up, we are facing a decline. On a per capita basis, most countries in the West are stagnating. Economists like Cowen and Gordon blame in part the relative lack of benefits due to the Internet and technological advancement in computers. But I am sure Roman intellectuals could have blamed the decline of their empire on the fact that late-stage innovations like the onager were not sufficiently impactful.

The covid era is a perfect illustration. Entire societies turned inward, locked up, shut down their industries, their entertainment. We did get an echo of science-fiction, but the dystopian kind. In communist China, we had the so-called big whites, a special police force wearing hazmat suits, enforcing lockdowns. Worldwide, politicians took the habit of wearing black masks. So the Empire from Star Wars was a tangible reality, but without much in the way of rebels. No Luke, no Lea came to be celebrated.

But culture is not a mere reflection of an era. In some sense, culture defines the era, it is a beast that is larger than any one of us. As much as you would like to think that politicians are in charge, they are strictly limited by their culture. Václav Havel made this point in The Power of the Powerless (1985): the ruler of a communist country has no more than a grocer to oppose the prevailing culture, and he may even have less power.

So how do we get back to a future-oriented culture? One where the challenges are met with courage. One where we fight our way through instead of turning inward and regressing?

Culture is all of us. Culture is what we do. It is what we talk about. And it is what we read.

Few books have the power to inspire readers quite like Devon Eriksen’s Theft of Fire. This work of science fiction takes us on a journey through the frozen edge of the solar system, where a hidden treasure lies waiting to be discovered. It is a tale of adventure, intrigue, and the indomitable spirit of humanity.

It is an imperfect, and even somewhat sad future. There are poor people, there are oppressed people. But it is world where you can hope. It is very much the path in our future where Elon Musk’s futuristic and optimistic vision came true. Yes, Elon Musk himself shaped this universe. Eriksen does not tell us that if ‘the man’ (Elon Musk) has his way, we will live forever in harmony. Quite the opposite. If you are intrigued by ChatGPT and Google Gemini, Eriksen has you covered, as well.

Here is a quote to illustrate Eriksen’s style:

In space, everyone can hear you scream. You vacsuit monitors your heartbeat, your blood pressure. It knows when you are injured. And it knows when to cry for help on the 100 MHz emergency band. And everyone is listening. Every suit radio. Every spacecraft. Every robot probe.

Theft of Fire offers a plausible and realistic universe. The characters, Marcus Warnoc and Miranda Foxgrove, are not mere archetypes; they are complex, flawed, and deeply human. Their struggle to trust one another and overcome their own demons is a powerful allegory for the modern human condition. There are class wars, between the very rich and the very poor, but it is not a Marxist story. Marcus might be poor and struggling, but he is never a victim. The rich people can be bad, but so can the poor people.
The world of Theft of Fire is one of contrasts: the cold, unforgiving vacuum of space and the warmth of human connection. It is a testament to the power of storytelling and the enduring appeal of the science fiction genre.

It may maybe telling to look at who published Eriksen’s book: it is a self-published book. As far as I can tell, one of Eriksen’s wives is in charge of marketing. She is the one who reached out to me and suggested this review. Maybe that’s how we change our destiny: from the bottom up. Just like in the novel.

More information: Eriksen’s home page.