Science and Technology links (April 17th 2021)

    1. Moderna built their COVID 19 vaccine without having the virus on site. They viewed it as a software problem.
    2. Human and mice with red hair have elevated pain thresholds.
    3. Tumors (cancer) consume high levels of sugar. You would think that it means that cancer cells consume a lot of sugar, but it appears that it is not case. Non-cancer cells within tumors are responsible for the high intake in sugar.
    4. Exergames are physical activities that also stress our intelligence. It appears that exergames might have tangible cognitive benefits.
    5. Women are entering menopause at an ever older age. They gained a year and half since since 1960.
    6. About 2.5 billion tyrannosaurus rex lived during the 2.4 million years that the species existed.
    7. Medical doctors tend to be pessimistic and to overestimate your probability of disease, which may lead to unnecessary treatments.

How fast can you sort arrays of integers in Java?

Programming languages come with sorting functions by default. We can often do much better. For example, Downs has showed that radix sort can greatly surpass default sort functions in C++. Radix sort is you friend if you want to sort large arrays of integers.

What about Java? Richard Startin and Gareth Andrew Lloyd have been working hard to improve the sorting function used inside the RoaringBitmap library. Though we use a custom radix sort function, it is not difficult to make it more generic, so that it can sort any array of integers. I came up with the following code:

public static void radixSort(int[] data) {
  int[] copy = new int[data.length];
  int[] level0 = new int[257];
  int[] level1 = new int[257];
  int[] level2 = new int[257];
  int[] level3 = new int[257];
  for (int value : data) {
    value -= Integer.MIN_VALUE;
    level0[(value & 0xFF) + 1] ++;
    level1[((value >>> 8) & 0xFF) + 1] ++;
    level2[((value >>> 16) & 0xFF) + 1] ++;
    level3[((value >>> 24) & 0xFF) + 1] ++;
  for (int i = 1; i < level0.length; ++i) {
    level0[i] += level0[i - 1];
    level1[i] += level1[i - 1];
    level2[i] += level2[i - 1];
    level3[i] += level3[i - 1];
  for (int value : data) {
    copy[level0[(value - Integer.MIN_VALUE) & 0xFF]++] = value;
  for (int value : copy) {
    data[level1[((value - Integer.MIN_VALUE)>>>8) & 0xFF]++] 
       = value;
  for (int value : data) {
    copy[level2[((value - Integer.MIN_VALUE)>>>16) & 0xFF]++] 
       = value;
  for (int value : copy) {
    data[level3[((value - Integer.MIN_VALUE)>>>24) & 0xFF]++] 
      = value;

It is about as unsophisticated as it looks. We compute four histograms, one per byte in an integer: Java stores integers using 4-byte words. Then we do 4 passes through the data. We could make it more sophisticated by examining the histogram: if the higher-level histograms are trivial, we can skip some passes. We could extend it to Java longs though we would then need 4 extra passes. It is also possible to generalize to floating-point numbers.

The strange subtraction with MIN_VALUE are to accommodate the fact that Java has signed integers (positive and negative) under a two complement’s format.

Let us compare it against the default Arrays.sort function in Java. We want to sort 1 million integers, generated uniformly at random. Using Java 8 on an Apple M1 processor, we get that RadixSort is ten times faster than Arrays.sort.

Arrays.sort 60 ms
RadixSort 5 ms

There are some caveats. The radix sort function is likely to use more memory. Furthermore, the results are sensitive to the input data (both its size and its distribution). Nevertheless, for some systems, radix sort can be a net win.

My code is available.

My programming setup

As my GitHub profile indicates, I program almost every single working day of the year. I program in many different languages such C++, C, Go, Java, JavaScript, Python, R, Swift, Rust, C#; even though I do not master all of these languages. Some of the projects I work on can be considered “advanced programming”. Some of my code is used in production.

I was recently asked how I build code.

I do not use a debugger in my main work. I will use debuggers exceptionally for hacking code that I do not understand. Before you object, keep in mind that I am in good company: Linus Torvalds, Brian W. Kernighan, Rob Pike and Guido van Rossum have stated that they do not rely primarily (or at all) on debuggers. Stepping through code is a slow, tiring process.

My primary environnement these days is Visual Studio Code. It is great for the following reasons:

  1. It is portable. I can switch from macOS to Windows and back, and it keeps working.
  2. It is fast. I know that’s a ridiculous statement to make considering that it is written in JavaScript, but it is just very smooth.
  3. It is simple. It is only a text editor after all.
  4. It allows me to program in all of the languages I care about.
  5. It is a state-of-the-art editor. I am never frustrated by a missing feature. I very rarely encounter bugs.
  6. I use fancy regular expressions all of the time and Visual Studio Code supports every expression I can throw to it.
  7. It has a a great terminal application embedded. It is better than most of the terminal applications provided by operating systems (especially under Windows).
  8. It integrates very nicely with ssh. A lot of my work involving hacking code that resides on a server (often linux). With Visual Studio Code, I can basically edit and run code on any system where ssh lives. It is so good that I sometimes forget that I am connected to a remote server.

That is pretty much all you would see on my screen when I am programming: Visual Studio Code. I do a lot of my work in the terminal. I use git all of the time for version control. The Go and Rust tools work nicely in a shell. For C++ and C, I use CMake or simple Makefiles, again mostly with command lines. You can invoke CMake command lines under Windows too, and have the Visual Studio compiler build your code. I find that C# has really nice support for command line builds and tests. Obviously, Java works well with gradle or maven. JavaScript has node and npm. I use Swift though not for build applications so I can rely on the Swift Package Manager. My go-to scripting language is Python. From the command-line results, if there are errors, Visual Studio Code often allows me to click and get directly at the offending line.

I often program inside Docker containers and that works really well with a terminal and an editor. Docker has been a game changer for me: I am no longer limited by whatever tools I can install on my host system.

I stay away from IDEs. People keep in mind that IDEs are not at all recent. Like many people, I started programming using IDEs. I learned to program seriously with Turbo Pascal, and then I used Visual Basic, Delphi, Visual Studio, Eclipse and so forth. I am not against IDEs per se and I will happily spin up Visual Studio, Xcode or IntelliJ but I do not want my workflow to depend on a specific tool and I like to script everything away. I realize that many people want to be able to press a button that builds and test their code, but I prefer to type ctest or go test or cargo test or npm test. Importantly, I want my code to work for other people no matter what their setup is: I find it offensive to require that collaborators use the same graphical tools that I do. Furthermore, my experience has been that though learning to use the command-line tools is initially harder, it tends to pay off in the long term via better maintainability, more automation, and a deeper knowledge of the tools.

Importantly, I am not “locked in” with Visual Studio Code. I can switch to any other text editor in an instant. And I sometimes do. If I need to change a file remotely, I might often use vim.

I never use code completion. If the editor provides it, I disable it. However, I often spend a lot of time reading a project’s documentation. If the documentation is missing, I might read the source code instead.

How do I deal with bugs? As I just stated, I will almost never run through the code execution. Instead I will use one of the following strategies:

  1. In an unsafe language like C++, you can get really nasty problems if you have illegal memory accesses. If I am getting strange bugs, I might run my code with sanitizers. It is a game changer: it makes C and C++ usable languages. There are various limitations to sanitizers and they do tend to make the compile-run cycle longer, so I do not use them routinely. In fact, I build most of my code in release mode; I rarely compile in debug mode.
  2. I write a lot of tests. In fact, most of the code that I write consists of tests. Once you have extensive tests, you usually can narrow down the bugs to one piece of code. I will then just read it over carefully. If I just sit quietly and study a small enough piece of code, I can often eventually understand and fix the bug. I am probably not a better programmer than you are, but you do not write tests, then I can be certain that your code has more bugs.
  3. I can make an obsessive use of continuous integration tests. Running all tests, on all platforms, with every single code change is great. In this manner, I avoid accumulating several bugs.
  4. I write documentation. Often, merely explaining (in the code or elsewhere) what a function should do is enough to figure out the bugs.
  5. I might use logging in hard cases. That is, I print out how the data is being processed. In sufficiently complex code, it pays to insert logging instructions describing the execution of the code, these logging instructions are enabled or disabled at compile time. These logs are rarely about the state of specific variables or location in the code. Rather, they present a semantically consistent log of how the data is getting processed. In my experience, it is only needed in specific cases to study what I might call “meta-bugs”: all of your functions are bug-free, but they interact poorly due to bad assumptions.

My work is often focused on performance. I sometimes examine the assembly output from the compiler, but rarely so. You can get most of the performance with simple programming techniques. However, you need to run benchmarks. I might not know your systems better than you do, but if you never write benchmarks, then my code is probably faster.

I also do not tend to rely very much on static analysis. I do not pay much attention to compiler warnings. I also shy away from “grand theories” of programming. I will happily use functional programming or object-oriented-programming, but I will never get stuck in one approach.

I would describe my approach to programming as being mostly empirical. I write code, then I test it then I benchmark it. This works in all programming languages, from C to JavaScript.

Remark: This is how I work but not a prescription on how other people should work. It is fine that you prefer to work differently, I encourage diversity. It is not fine to act as if I am lecturing you to work a certain way.

Science and Technology links (March 27th 2021)

    1. Scientists, including climate-science researchers, often travel to faraway places for conferences. Attending a live conference is time consuming and expensive. The cost is relative: attending a $3000 conference in Hawaii is cheap for the Harvard student, but a considerably higher expense for people elsewhere in the world. The conference is also out of reach for people who need to care for young children. People often claim that the environmental cost of the travel can be offset but there is no hard evidence that it actually works. The pandemic has forced scientists to suspend their travel. Should they resume? Niner and Wassermann write:

      Avoiding international travel and associated bureaucracy, time and expense could overcome many of the historic injustices preventing many from participating in and benefiting from international conferences, and also avoid the emissions associated with international air travel. However, prior to 2020, there has been resistance to moving these events online because of the perception that the value of conferences cannot be cultivated online. (…) we conclude that holding international conferences online (..) is a significant improvement in the capacity of conferences to meet the moral imperatives of the conservation community by addressing the climate crisis and some of the systemic injustices within the field.

    2. We found a largely preserved basket that is 10,000-year old. This was shortly after the last ice age.
    3. Wearable devices can help people lose weight.
    4. Disk drives should exceed 100 TB by 2030. 100 TB is enough to store everything your eyes can see during a decade.
    5. There is such a thing as too much exercise.
    6. Students in many countries do not gain critical thinking skills in college.
    7. Your nervous system is protected by myelin. As you age, your myelin regenerate more and more poorly. It appears that this might be reversible.
    8. There are more human twins than ever.
    9. A 70-year-old Albatross has given birth. It is generally difficult to tell how old a bird is, but this particular Albatross was tagged many years ago. Like many other animal species, we believe that many sea birds experience negligible senescence which means that they do not become less fertile or more frail with time.
    10. Though you keep the same genes all your life, your genes are marked by methylation and that can change over time. In other words, your DNA is not “stateless”, it can change. Some genes can thus become silent while others can become active. Using machine learning, we can use your methylation state to determine your biological age. Researchers have reported that methylation can further predict your clinical outcome when affected by SARS-CoV-2 (COVID 19). In time, this could be used to quickly identify people most at risk for some diseases. Meanwhile, losing weight appears to lower your biological age as defined by methylation. You might become biologically younger by about a year if you lose a lot of weight.
    11. When doing resistance training (lifting weights), I have often been told to lift to failure (until you no longer can lift). It seems that it might be poor advice. Instead, you may want to tailor your training to maximize volume (total effort).
    12. A low-carbohydrate diet might help you keep your muscle mass as you grow older. It does in mice. Further, vegans may have poorer bone health as they age.

Counting cycles and instructions on the Apple M1 processor

When benchmarking software, we often start by measuring the time elapsed. If you are benchmarking data bandwidth or latency, it is the right measure. However, if you are benchmarking computational tasks where you avoid  disk and network accesses and where you only access a few pages of memory, then the time elapsed is often not ideal because it can vary too much from run to run and it provides too little information.

Most processors will adjust their frequency in response to power and thermal constraints. Thus you should generally avoid using a laptop. Yet even if you can get stable measures, it is hard to reason about your code from a time measurement. Processors operate in cycles, retiring instructions. They have branches, and sometimes they mispredict these branches. These are the measures you want!

You can, of course, translate the time in CPU cycles if you know the CPU frequency. But it might be harder than it sounds because even without physical constraints, processors can vary their frequency during a test. You can measure the CPU frequency using predictable loops. It is a little bit awkward.

Most people then go to a graphical tool like Intel VTune or Apple Instruments. These are powerful tools that can provide fancy graphical displays, run samples, record precise instruction counts and so forth. They also tend to work across a wide range of programming languages.

These graphical tools use the fact that processor vendors include “performance counters” in their silicon. You can tell precisely how many instructions were executed between two points in time.

Sadly, these tools can be difficult to tailor to your needs and to automate. Thankfully, the Linux kernel exposes performance counters, on most processors. Thus if you write code for Linux, you can rather easily query the performance counters for yourself. Thus you can put markers in your code and find out how many instructions or cycles were spent between these markers. We often refer to such code as being “instrumented”. It requires you to modify your code and it will not work in all programming languages, but it is precise and flexible. It even works under Docker if you are into containers. You may need privileged  access to use the counters. Surely you can also access the performance counters from your own program under Windows, but I never found any documentation nor any example.

My main laptop these days is a new Apple macbook with an M1 processor. This ARM processor is remarkable. In many ways, it is more advanced that comparable Intel processors. Sadly, until recently, I did not know how to instrument my code for the Apple M1.

Recently, one of the readers of my blog (Duc Tri Nguyen) showed me how, inspired by code from Dougall Johnson. Dougall has been doing interesting research on Apple’s processors. As far as I can tell, it is entirely undocumented and could blow up your computer. Thankfully, to access the performance counters, you need administrative access (wheel group). In practice, it means that you could start your instrumented program in a shell using sudo so that your program has, itself, administrative privileges.

To illustrate the approach, I have posted a full C++ project which builds an instrumented benchmark. You need administrative access and an Apple M1 system. I assume you have installed the complete developer kit with command-line utilities provided by Apple.

I recommend measuring both minimal counters as well as the average counters. When the average is close to the minimum, you usually have reliable results. The maximum is less relevant in computational benchmarks. Observe that measures taken during a benchmark are not normally distributed: they are better described as following a log-normal distribution.

The core of the benchmark looks like the following C++ code:

  performance_counters agg_min{1e300};
  performance_counters agg_avg{0.0};
  for (size_t i = 0; i < repeat; i++) {
    performance_counters start = get_counters();
    performance_counters end = get_counters();
    performance_counters diff = end - start;
    agg_min = agg_min.min(diff);
    agg_avg += diff;
  agg_avg /= repeat;

Afterward, it is simply a matter of printing the results. I decided to benchmark floating-point number parsers in C++. I get the following output:

# parsing random numbers
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB 
model: generate random numbers uniformly in the interval [0.000000,1.000000]
volume: 10000 floats
volume = 0.0762939 MB
    strtod    375.92 instructions/float (+/- 0.0 %)
                75.62 cycles/float (+/- 0.1 %)
                4.97 instructions/cycle
                88.93 branches/float (+/- 0.0 %)
                0.6083 mis. branches/float

fastfloat    162.01 instructions/float (+/- 0.0 %)
                22.01 cycles/float (+/- 0.0 %)
                7.36 instructions/cycle
                38.00 branches/float (+/- 0.0 %)
                0.0001 mis. branches/float

As you can see, I get the average  number of instructions, branches and mispredicted branches for every floating-point number. I also get the number of instructions retired per cycle. It appears that on this benchmark, the Apple M1 processor gets close to 8 instructions retired per cycle when parsing numbers with the fast_float library. That is a score far higher than anything possible on an Intel processor.

You should note how precise the results are: the minimum and the average number of cycles are almost identical. It is quite uncommon in my experience to get such consistent numbers on a laptop. But these Apple M1 systems seem to show remarkably little variation. It suggests that there is little in the way of thermal constraints. I usually avoid benchmarking on laptops, but I make an exception with these laptops.

You should be mindful that the ‘number of instructions’ is an ill-defined measure. Processors can fuse or split instructions, and they can decide to count or not the number of speculative instructions. In my particular benchmark, there are few mispredicted branches so that the difference between speculative instructions and actually retired instructions is not important.

To my knowledge, none of this performance-counter access is documented by Apple. Thus my code should be viewed with suspicion. It is possible that these numbers are not what I take them to be. However, the numbers are generally credible.

My source code is available.

Note: Though my code only works properly under the Apple M1 processor, I believe it could be fixed to support Intel processors.

Apple’s M1 processor and the full 128-bit integer product

If I multiply two 64-bit integers (having values in [0, 264)), the product requires 128 bits. Intel and AMD processors (x64) can compute the full (128-bit) product of two 64-bit integers using a single instruction (mul). ARM processors, such as those found in your mobile phone, require two instructions to achieve the same result: mul computes the least significant 64 bits while mulh computes the most significant 64 bits.

I believe that it has often meant that computing the full 128-bit product was more expensive, everything else being equal, on ARM processors than on x64 (Intel) processors. However, the instruction set does not have to determine the performance. For example, ARM processors can recognize that I am calling both instructions (mul  and mulh) and process them more efficiently. Or they may simply have very inexpensive multipliers.

To explore the problem, let us pick two pseudo-random number generators, splitmix and wyhash:

uint64_t splitmix() {
    uint64_t z = (state += UINT64_C(0x9E3779B97F4A7C15));
    z = (z ^ (z >> 30)) * UINT64_C(0xBF58476D1CE4E5B9);
    z = (z ^ (z >> 27)) * UINT64_C(0x94D049BB133111EB);
    return z ^ (z >> 31);
uint64_t wyhash() {
    state += 0x60bee2bee120fc15ull;
    __uint128_t tmp = (__uint128_t)(state)*0xa3b195354a39b70dull;
    uint64_t m1 = (tmp >> 64) ^ tmp;
    tmp = (__uint128_t)m1 * 0x1b03738712fad5c9ull;
    return (tmp >> 64) ^ tmp;

As I reported earlier, wyhash should almost always be faster on an Intel or AMD processor as it is only two multiplications with an addition whereas the splitmix function is made of two multiplications with several other operations. However, wyhash requires two full multiplications whereas splitmix requires only two 64-bit products. If the two full multiplications in wyhash are equivalent two four multiplications, then wyhash becomes more expensive.

I wrote a small C++ benchmark to measure the time (in nanoseconds) that it takes to compute a random value using Apple’s new M1 processor (ARM, 3.2 GHz). The compiler is Apple clang version 12 which comes by default on the new Apple Silicon laptops.

Apple M1
wyhash 0.60 ns/value
splitmix 0.85 ns/value

My test suggests that it takes a bit under 3 cycles to generate a number with splitmix and a bit under 2 cycles to generate a number with wyhash. The wyhash generator is much faster than splitmix on the Apple M1 processor (by about 40% to 50%) which suggests that Apple Silicon is efficient at computing the full 128-bit product of two 64-bit integers. It would be nicer to be able to report the number of instructions and cycles, but I do not know how to instrument code under macOS for such measures.

Further reading: Apple M1 Microarchitecture Research by Dougall Johnson

Credit: Maynard Handley suggested this blog post.

Update: The numbers were updated since they were off by a factor of two due to a typographical error in the code.

Update 2: An interested reader provided me with the means to instrument the code. The precise number of cycles per value is a bit over 2.8 for splitmix and exactly 2 for wyhash. Please see my repository for the corresponding code.

Appendix. Some readers are demanding assembly outputs. The splitmix function compiles to 9 assembly instructions

LBB7_2:                                 ; =>This Inner Loop Header: Depth=1
	eor	x12, x9, x9, lsr #30
	mul	x12, x12, x10
	eor	x12, x12, x12, lsr #27
	mul	x12, x12, x11
	eor	x12, x12, x12, lsr #31
	str	x12, [x0], #8
	add	x9, x9, x8
	subs	x1, x1, #1              ; =1	LBB7_2

while the wyhash function compiles to 10 assembly instructions

LBB8_2:                                 ; =>This Inner Loop Header: Depth=1
	mul	x12, x9, x10
	umulh	x13, x9, x10
	eor	x12, x13, x12
	mul	x13, x12, x11
	umulh	x12, x12, x11
	eor	x12, x12, x13
	str	x12, [x0], #8
	add	x9, x9, x8
	subs	x1, x1, #1              ; =1	LBB8_2

Science and Technology links (March 6th 2021)

  1. Increasing schooling does not improve social outcomes at a population level.
  2. Venitian glass was made near Venice as early as 450 BC. It spread worldwide through trade. Venetian glass made its way as far as North America. We have now determined that it was present in Alaska before Christopher Columbus. We expect that it came there through Asia.
  3. Chinese researchers have carried on genetic manipulation on monkeys, giving them a human gene. It could be that the monkeys become smarter as a result.
  4. Ocean heat may have a limited effect on marine glaciers.
  5. Genetics determines most social outcomes, based on historical outcomes.
  6. Endurance exercise may improve cognitive function.
  7. Fewer people are having strokes.
  8. Video game play is positively correlated with well-being
  9. Parkinson’s may be linked to having too much iron. It is relatively easy to lower someone’s iron store.

How does your programming language handle “minus zero” (-0.0)?

The ubiquitous IEEE floating-point standard defines two numbers to represent zero, the positive and the negative zeros. You also have the positive and negative infinity. If you compute the inverse of the positive zero, you get the positive infinity. If you compute the inverse of the negative zero, you get the negative infinity.

It seems that programming languages handle these values in vastly different ways.

Though the C++ standard probably says very little about negative zeros, in practice I am getting the results I would expect. In C++ (GNU GCC and LLVM clang), the following code prints out (-inf, inf, -inf) :

double minus_zero = -0.0;
double plus_zero = +0.0;
double parsed = strtod("-0.0", nullptr);
std::cout <<  1.0/minus_zero << std::endl;
std::cout <<  1.0/plus_zero << std::endl;
std::cout <<  1.0/parsed << std::endl;

Java also produces what I expect (-Infinity, Infinity, -Infinity) from the following code sample:

double minus_zero = -0.0;
double plus_zero = +0.0;
double parsed = Double.valueOf("-0.0");

Swift also seems to do what I expect (-inf, inf, -inf) :

var minus_zero = -0.0
var plus_zero = +0.0
var parsed = Double("-0.0")!

The Go programming language seems to be unaware of negative zeros when parsing literal constants. Thus the following code will print out +Inf, +Inf.

var minus_zero = -0.0
var plus_zero = 0.0

C# is aware of negative infinity, but when parsing strings the behaviour depends on your version. With .NET 5, the following code prints out what I expect (-infinity, infinity, -infinity) but with previous versions you may get -infinity, infinity, infinity.

double minus_zero = -0.0;
double plus_zero = +0.0;
double parsed = Double.Parse("-0.0");

Parsing floating-point numbers really fast in C#

Programmers often write out numbers as strings (e.g., 3.1416) and they want to read back the numbers from the string. If you read and write JSON or CSV files, you do this work all of the time.

Previously, we showed that we could parse floating-point numbers at a gigabyte per second or better in C++ and in Rust, several times faster than the conventional approach. In Go 1.16, our approach improved parsing performance by up to a factor of two.

Not everyone programs in C++, Rust or Go. So what about porting the approach to C#? csFastFloat is the result!

For testing, we rely on two standard datasets, canada and mesh. The mesh dataset is made of “easy cases” whereas the canada dataset is more difficult. We use .NET 5 and an AMD Rome processor for testing.

parser canada mesh
Double.Parse (standard) 3 million floats/s 11 million floats/s
csFastFloat (new) 20 million floats/s 35 million floats/s

Importantly, the new approach should give the same exact results. That is, we are accurate.

Can this help in the real world? I believe that the most popular CSV (comma-separate-values) parsing library in C# is probably CSVHelper. We patched CSVHelper so that it would use csFastFloat instead of the standard library. Out of a set of five float-intensive benchmarks, we found gains ranging from 2x to 8%. Your mileage will vary depending on your data and your application, but you should see some benefits.

Why would you see only an 8% gain some of the time? Because, in that particular case, only about 15% of the total running time has to do with number parsing. The more you optimize the parsing in general, the more benefit you should get out of fast float parsing.

The package is available on nuget.

Credit: The primary author is Carl Verret. We would like to thank Egor Bogatov from Microsoft who helped us improve the speed, changing only a few lines of code, by making use of his deep knowledge of C#.

Science and Technology links (February 13th 2021)

  1. Researchers make inexpensive transparent wood.
  2. Our cells produce energy using their mitochondria. Researchers show that you can efficiently isolate mitochondria from mammalian cells.
  3. Vitamin D supplementation could save tens of thousands of lives annually in Germany alone by preventing cancer.
  4. It appears that Alpha Centauri may have inhabitable planets. Last week I reported that that we might be getting an artificial signal from Alpha Centauri.
  5. Researchers report that supplementing middle-aged mice with nicotinamide riboside made them stronger. You can purchase such a supplement for yourself.  (I do not make medical recommendations.)
  6. Age discrimination when hiring costs the US economy hundreds of billions of dollars. Age discrimination is reportedly widespread in the technology industry.

On the cost of converting ASCII to UTF-16

Many programming languages like Java, JavaScript and C# represent strings using UTF-16 by default. In UTF-16, each ‘character’ uses 16 bits. To represent all 1 million unicode characters, some special ‘characters’ can be combined in pairs (surrogate pairs), but for much of the common text, one character is truly 16 bits.

Yet much of the text content processed in software is simple ASCII. Strings of numbers for example are typically just ASCII. ASCII characters can be represented using only 7 bits.

It implies that software has frequently to convert ASCII to UTF-16. In practice, it amounts to little more than to interleave our ASCII bytes with zero bytes. We can model such a function with a simple C loop.

void toutf16(const uint8_t *array, size_t N,
              uint16_t *out) {
  for (size_t i = 0; i < N; i++) {
    out[i] = array[i];

How expensive do we expect this code to be?

Compared to simple copy from N bytes to N bytes, we are writing an extra N bytes. With code that reads and writes a lot of data, it is often sensible to use as a model the number of written bytes.

In terms of instructions, an x64 processor can use SIMD instructions to accelerate the processing. However, you would hope that most processors can do this processing at high speed.

I wrote a benchmark in C and ran it on different systems. I use a small input ASCII string (10kB). I measure the throughput based on the input size.

to utf16 memcpy
AMD Zen 2 (x64), GNU GCC 8, -O3 24 GB/s 46 GB/s
Apple M1, clang 12 35 GB/s 68 GB/s

Of course results will vary and I expect that it is entirely possible to greatly accelerate my C function. However, it seems reasonable to estimate that the computational cost alone might be twice that of a memory copy. In practice, it is likely that memory allocation and structure initialization might add a substantial overhead when copying ASCII content into a UTF-16 string.

Science and Technology links (February 6th 2021)

  1. You can use artificial intelligence and satellite images to count the number of elphants found in the wild.
  2. It appears that a billion people on Earth now use an iPhone. The number would be higher if not for the pandemic.
  3. A supplement used by body builders (alpha-ketoglutarate) made old mice healthier, and they lived longer as a result. The treated mice looked biologically younger.
  4. According to an article published in Nature, the Antartic continent has not warmed in the last seven decades. The sea ice area has also grown.
  5. It appears that children with autism become less active physically over time compared to neurotypical children.
  6. We are getting a signal from the closest star to our own (Proxima Centauri) and the signal might be a sign of intelligent life.
  7. New research suggests that adding cheese and red wine to the diet daily, and lamb on a weekly basis, may also improve long-term cognitive outcomes.

Number Parsing at a Gigabyte per Second

Computers typically rely on binary floating-point numbers. Most often they span 64 bits or 32 bits. Many programming languages call them double and float. JavaScript represents all its numbers, by default, with a 64-bit binary floating-point number type.

Human beings most of often represent numbers in decimal notation, such as 0.1 or 1e-1. Thus many systems store numbers in decimal notation using ASCII text. The software must go from binary floating-point numbers to ASCII and back. There has been much work done on the serialization (from binary floating-point numbers to ASCII) but comparatively less work on the deserialization (from ASCII to binary floating-point numbers).

Typically, reading decimal numbers and converting them to binary floating-point numbers is slow. How slow? Often on the order of 200 MB/s. So much slower than your disk, if you have a fast disk. A PlayStation 5 has a disk capable of over 5 GB/s in bandwidth.

You can do much better. I finally published a manuscript that explains a better approach: Number Parsing at a Gigabyte per Second. Do not miss the acknowledgements section of the paper: this was joint work with really smart people.

The benchmarks in the paper are mostly based on the C++ library fast_float. The library requires a C++11 standard compliant compiler. It provides functions that closely emulate the standard C++ from_chars functions for float and double types. It is used by Apache Arrow and Yandex ClickHouse. It is also part of the fastest Yaml library in the world. These from_char functions are part of the C++17 standard. To my knowledge, only microsoft implemented it at this point: they are not available in GNU GCC.

On my Apple M1 MacBook, using a realistic data file (canada), we get that fast_float can far exceeds a gigabyte per second, and get close to 1.5 GB/s. The conventional C function (strtod) provided by the default Apple standard library does quite poorly on this benchmark.

What about other programming languages?

A simplified version of the approach is now part of the Go standard library, thanks to Nigel Tao and other great engineers. It accelerated Go processing while helping to provide exact parsing. Nigel Tao has a nice post entitled The Eisel-Lemire ParseNumberF64 Algorithm.

What about Rust? There is a Rust port. Unsurprisingly, the Rust version is a match for the C++ version, speed-wise. Here are the results using the same file and the same processor (Apple M1):

from_str (standard) 130 MB/s
lexical (popular lib.) 370 MB/s
fast-float 1200 MB/s

There is an R binding as well, with the same great results:

On our machine, fast_float comes out as just over 3 times as fast as the next best alternative (and this counts the function calls and all, so pure parsing speed is still a little bettter).

A C# port is in progress and preliminary results suggest we can beat the standard library by a healthy margin. I am hoping to get a Swift and Java port going this year (help and initiative are invited).

Video. Last year, I gave a talk at Go Systems Conf SF 2020 entitled Floating-point Number Parsing w/Perfect Accuracy at GB/sec. It is on YouTube.

Further reading. See my earlier posts… Fast float parsing in practice (March 2020 blog post) and Multiplying backward for profit (April 2020 blog post).

Science and Technology links (January 24th 2021)

  1. Year 2020 was great for PC makers. We are selling more and more PCs. Reportedly, Sony sold 3.4 million PlayStation 5 in only four weeks, a record. The demand for the Facebook Quest 2 VR headset is reportedly several times the demand for the original Quest. Valve, the company that makes the Index VR headset, is unable to make enough to meet demand. The demand for computer chips is so high that car manufacturers ran out of chips and had to close car factories. The demand for workers in information technology has remained strong throughout 2020.
  2. There might be a way to predict autism in children by studying the sperm of the father.
  3. As we age, we accumulate senescent cells. In chronic large quantities, these cells are harmful. Thankfully, there are now effective therapies to remove these cells, at least in mice. Doing so makes the old mice smarter. This suggests that the senescent cells makes them dumber in the first place.
  4. Supplementing old mice with a hormone that our bodies generate during execise makes them run twice as hard.
  5. Researchers rejuvenated old human cells by 30 years. The process is tricky and most be done in a laboratory but it is conceivable that human tissue could be rejuvenated in a laboratory, prior to transplantation.
  6. Cold water immersion following an endurance workout probably reduces your muscle growth. So go easy with the cold showers and baths!

Science and Technology links (January 16th 2021)

  1. You can tell people’s political affiliation by image recognition technology.
  2. There are far fewer stars and galaxies than we thought. The universe is relatively small. (The source article has been revised with different conclusions.)
  3. Dog ownership conferred a 31% risk reduction for cardiovascular death.
  4. People with high total cholesterol or LDL-C live just as long or longer than people with low cholesterol. Statin trials have been unable to lower total mortality; no statin trial has succeeded with lowering mortality in women, elderly people, or diabetics; and that cholesterol-lowering with statins has been associated with many serious side effects.
  5. Eat fat with potatoes is probably a bad idea if you are hoping to lose weight.

Science and Technology links (January 9th 2021)

  1. The Earth is spinning faster and faster: The 28 fastest days on record (since 1960) all occurred in 2020, with Earth completing its revolutions around its axis milliseconds quicker than average.
  2. We are soon getting a new Wi-Fi standard called Wi-Fi 6: it supports data transmission at over 1 GB/s, nearly three times the speed of previous standards.
  3. Eli Dourado predicts:

    By the middle of the decade, augmented reality will be widely deployed, in the same way that smart watches are today. Glasses will be computing devices. Every big tech company has a glasses project at a relatively mature stage in the lab today. The need for the glasses to understand context could result in much smarter digital assistants than today’s Siri, Alexa, and so on.

    I agree with Dourado.

  4. Vitamin D3 may reduce the risk of developing advanced cancer.
  5. High-tech running shoes improve elite marathon performance.
  6. In our brains, neurons are connected toghether by their dendrites. It seems that, at least in human beings, large dendrites are capable of computation on their own.
  7. A study of 145 journals in various fields, spanning 1.7 million authors, revealed that there was no obvious bias against authors with female names and, in fact, female authors might even be slightly favored by referees. Such a study does not show that sexism does not exist. It does not show that journals have never been biased against female authors.

Memory access on the Apple M1 processor

When a program is mostly just accessing memory randomly, a standard cost model is to count the number of distinct random accesses. The general idea is that memory access is much slower than most other computational tasks.

Furthermore, the cost model can be extended to count “nearby” memory accesses as free. That is, if I read a byte at memory address x and then I read a byte at memory address x+1, I can assume that the second byte comes “for free”.

This naive memory-access model is often sensible. However, you should always keep in mind that it is merely a model. A model can fail to predict real performance.

How might it fail? A CPU core can issue multiple memory requests at once. So if I need to access 7 memory locations at once, I can issue 7 memory requests and wait for them. It it is likely that waiting for 7 memory requests is slower than waiting for a single memory request, but is it likely to be 7 times slower?

The latest Apple laptop processor, the M1, has apparently a lot of memory-level parallelism. It looks like a single core has about 28 levels of memory parallelism, and possibly more.

Such a high degree of memory-level parallelism makes it less likely that our naive random-memory model applies.

To test it out, I designed the following benchmark where I compare three functions. The first one just grabs pairs of randomly selected bytes and it computes a bitwise XOR between them before adding them to a counter:

  for(size_t i = 0; i < 2*M; i+= 2) {
    answer += array[random[i]] ^ array[random[i + 1]];

We compare against a 3-wise version of this function:

  for(size_t i = 0; i < 3*M; i+= 3) {
    answer += array[random[i]] ^ array[random[i + 1]] 
              ^ array[random[i + 2]];

Our naive memory-access cost model predicts that the second function should be 50% more expensive. However many other models (such as a simple instruction count) would also predict a 50% overhead.

To give our naive memory-access model a run for its money, let us throw in a 2-wise version that also accesses nearby values (with one-byte offset):

  for(size_t i = 0; i < 2*M; i+= 2) {
    int idx1 = random[i];
    int idx2 = random[i + 1];
    answer += array[idx1] ^ array[idx1 + 1] 
           ^ array[idx2]  ^ array[idx2 + 1];

Our naive memory-access cost model would predict that first and last function should have about the same running time while the second function should be 50% more expensive.

Let us measure it out. I use a 1GB array and I report the average time spent in nanosecond on each iteration.

2-wise 8.9 ns
3-wise 13.0 ns
2-wise + 12.5 ns

At first glance, our naive memory-access model is validated: the 3-wise function is 46% more expensive than the 2-wise function. Yet we should not be surprised because most reasonable models would make such a prediction since in almost every way, the function does 50% more work.

It is more interesting to compare the two 2-wise function… the last one is 40% more expensive than the first 2-wise function. It contradicts our prediction. And so, at least in this instance, our simple memory-access cost model fails us on the Apple M1 processor.


  1. My source code is available. The run-to-run variability is relatively high on such a test, but the conclusion is robust, on my Apple M1 system.
  2. I posted the assembly online.
  3. Importantly, I do not predict that other systems will follow the same pattern. Please do not run this benchmark on your non-M1 PC and expect comparable results.
  4. This benchmark is meant to be run on an Apple MacBook with the M1 processor, compiled with Apple’s clang compiler. It is not meant to be used on other systems.

Peer-reviewed papers are getting increasingly boring

The number of researchers and peer-review publications is growing exponentially.  It has been estimated that the number of researchers in the world doubles every 16 years and the number of research outputs is increasing even faster.

If you accept that published research papers are an accurate measure of our scientific output, then we should be quite happy. However, Cowen and Southwood take an opposing point of view and represent this growth as a growing cost without associated gains.

(…) scientific inputs are being produced at a high and increasing rate, (…) It is a mistake, however, to infer that increases in these inputs are necessarily good news for progress in science (…) higher inputs are not valuable p​er se​, but instead they are a measure of cost, namely how much is being invested in scientific activity. The higher the inputs, or the steeper the advance in investment, presumably we might expect to see progress in science be all the more impressive. If not, then perhaps we should be worried all the more.

So are these research papers that we are producing in greater numbers… the kind of research papers that represent real progress? Bhattacharya and Packalen conclude that though we produce more papers, science itself is stagnating because of the worse incentives which focuses the research on low-risk/no-reward ventures as opposed to genuine progress:

This emphasis on citations in the measurement of scientific productivity shifted scientist rewards and behavior on the margin toward incremental science and away from exploratory projects that are more likely to fail, but which are the fuel for future breakthroughs. As attention given to new ideas decreased, science stagnated.

Thurner et al. concur in the sense that they find that “out-of-the-box” papers are getting harder to find:

over the past decades the fraction of mainstream papers increases, the fraction of out-of-the-box decreases

Surely, the scientists themselves have incentives to course correct and encourage themselves to produce more important and exciting research papers?

Collison and Nielsen challenge scientists and institutions to tackle this perceived diminishing scientific productivity:

Most scientists strongly favor more research funding. They like to portray science in a positive light, emphasizing benefits and minimizing negatives. While understandable, the evidence is that science has slowed enormously per dollar or hour spent. That evidence demands a large-scale institutional response. It should be a major subject in public policy, and at grant agencies and universities. Better understanding the cause of this phenomenon is important, and identifying ways to reverse it is one of the greatest opportunities to improve our future.

If we believe that research papers are becoming worse, that fewer of them convey important information, then the rational approach is to downplay them. Whenever you encounter a scientist and they tell you about how many papers they have published or where they were published, or how many citations they got… you should not mock the scientist in question, but you ought to bring the conversation at another level. What is the scientist working on and why is it important work? Dig below the surface.

Importantly, it does not mean that we should discourage people from publishing a lot of papers not anymore than we generally discourage programmer from writing many lines of code. Everything else being equal, people who love what they are doing, and who are good at it, will do more of it. But nobody would mistake someone who writes a lot as a good writer if they aren’t.

We need to challenge the conventional peer-reviewed research paper, by which I refer to a publication was reviewed by 2 to 5 peers before getting published. It is a relatively recent innovation that may not always be for the best. People like Einstein did not go through this process, at least not in their early years. Research used to be more more like “blogging”. You would write up your ideas and share them. People could read them and criticize them. This communication process can be done with different means: some researchers broadcast their research meetings online.

The peer-reviewed research papers allow you to “measure” productivity. How many papers in top-tier venues did research X produce? And that is why it grew so strong.

There is nothing wrong with people seeking recognition. Incentives are good. But we should reward people for the content of their research, not for the shallow metadata we can derive from their resume. If you have not read and used someone’s work, you have no business telling us whether they are good or bad.

The other related problem is the incestious relationship between researchers and assessment. Is the work on theory X important? “Let us ask people who work on theory X”. No. You have to have customers, users, people who have incentives to provide honest assessments. A customer is someone who uses your research in an objective way. If you design a mathematical theory or a machine-learning algorithm and an investment banker relies on it, they are your customer (whether they are paying you or not). If it fails, they will stop using it.

It seems like the peer-review research papers establish this kind of customer-vendor relationship where you get a frank assessment. Unfortunately, it fails as you scale it up. The customers of the research paper are the independent readers, that is true, but they are the readers who have their own motivations.

You cannot easily “fake” customers. We do so sometimes, with movie critics, say. But movie critics have an incentive to give your recommendations you can trust.

We could try to emulate the movie critic model in science. I could start reviewing papers on my blog. I would have every incentive to be a good critic because, otherwise, my reputation might suffer. But it is an expensive process. Being a movie critic is a full time job. Being a research paper critic would also be a full time job.

What about citations? Well, citations of often granted by your nearest peers. If they are doing work that resembles yours, they have no incentive to take it down.

In conclusion, I do find it credible that science might be facing a sort of systemic stagnation brought forth by a set of poorly aligned incentives. The peer-reviewed paper accepted at a good venue as the ultimate metric seems to be at the core of the problem. Further, the whole web of assessment in modern science often seems broken. It seems that, on an individual basis, researchers ought to adopt the following principles:

  1. Seek objective feedback regarding the quality of your own work using “customers”: people who would tell you frankly if your work was not good. Do not mislead citations or “peer review” for such an assessment.
  2.  When assessing another research, try your best to behave as a customer who has some distance from the research. Do not count inputs and outputs as a quality metric. Nobody would describe Stephen King as a great writer because he published many books. If you are telling me that Mr Smith is a great researcher, then you should be able to tell me about the research and why it is important.

Further reading:

My Science and Technology review for 2020

    1. The original PlayStation game console (1994) was revolution thanks in part to its CD drive that could read data at an astonishing 0.3 MB/s. In 2020, the PlayStation 5 came out with 5 GB/s of disk bandwidth, so over 15,000 more bandwidth than the original PlayStation.
    2. The Samsung S10+ phone can be purchased with 1 TB of storage. It is enough storage to record everything you hear in daily life for ten years or to store everything you see for several weeks. It is relatively easy to buy a 4 TB SSD for your PC, but difficult to go much higher.
    3. Drones are used to keep Europeans in check during the COVID 19 pandemics, they take people’s temperature and issue fines.
    4. In the state of New York, people can get married legally by videoconference.
    5. Virtually all kids and college students have taken online classes in 2020 in the developed world. It is now a widely held view (though not uncontested) that the future of colleges is online.
    6. UCLA researchers have achieved widespread rejuvenation in old mice through blood plasma dilution, a relatively simple process, they plan to conduct clinical trials in human beings “soon”. (Other reference.)

Science and Technology links (December 26th 2020)

  1. Researchers used a viral vector to manipulate eye cells genetically to improve the vision of human beings.
  2. Seemingly independently, researchers have reported significant progress regarding the solution of the Schrödinger equation using deep learning: Puppin et al., Hermann et al.
  3. The Dunning-Kruger Effect Is Probably Not Real. I am becoming quite upset with respect to the many effects in psychology that fail to be independently verified. And I’d feel better if it were only a problem in psychology.
  4. Can the technology behind COVID-19 vaccines lead to other breakthroughs?