It is easier for an optimizing compiler to spot and eliminate redundant operations if it can operate over a large block of code. Nevertheless, it is still recommended to stick with small functions. Small functions are easier to read and debug. Furthermore, you can often rely on the compiler smartly inline your small functions inside larger functions. Furthermore, if you have a just-in-time compiler (e.g., in C# and Java), the compiler may often focus its energy on functions that are called more often. Thus small functions are more likely to get optimized.
Nevertheless, you sometimes want to manually inline your code. Even the smartest compilers get it wrong. Furthermore, some optimizing compilers are simply less aggressive than others.
We have been working on a fast float-parsing library in C# called csFastFloat. Though it is already several times faster than the standard library, the primary author (Verret) wanted to boost the performance further by using SIMD instructions. SIMD instructions are fancy instructions that allow you to process multiple words at once, unlike regular instructions. The C# language allows you to use SIMD instructions to speed up your code.
He encountered a case where using SIMD instructions failed to bring the exciting new performance he was expecting. The easy way out in the end was to “manually inline” the functions. I am going to review a simplified instance because I believe it is worth discussing.
I am not going to review it in detail, but the following C# function allows you to quickly determine if 16 bytes are made of ASCII digits. In ASCII, the character ‘0’ has value 48 whereas the character ‘9’ has value 57. The function builds two “vectors” made of the values 47 and 58 (repeated 16 times) and then it does 16 comparisons at once (with CompareGreaterThan and then CompareLessThan). It concludes with a couple of instructions to check whether all conditions are met to have 16 digits.
unsafe static bool is_made_of_sixteen_digits(byte* chars) { Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47); Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58); Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)chars); var a = Sse2.CompareGreaterThan(raw, ascii0); var b = Sse2.CompareLessThan(raw, after_ascii9); var c = Sse2.Subtract(a, b); // this is not optimal return (Sse41.TestZ(c,c)); }
Let us write a function that tries to determine if I have a string that begins with 16 digits, or 32 digits. It is a made up function that I just use for illustration purposes.
// return 2 if 32 digits are found // return 1 if 16 digits are found // otherwise return 1 unsafe static int ParseNumberString(byte* p, byte* pend) { if ((p + 16 <= pend) && is_made_of_sixteen_digits(p)) { if((p + 32 <= pend) && is_made_of_sixteen_digits(p + 16)) { return 2; } return 1; } return 0; }
If you just write that out, C# might fail to inline the is_made_of_sixteen_digits function. Thankfully, you can tell C# that you do want it to inline the child function by labelling it with an “AggressiveInlining” attribute.
So far so good, right?
Let us look at the assembly output to see how it gets compiled. Do not worry, I do not expect you to read this gibberish. Just look at the lines that start with an arrow (“->”).
<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*) L0000: push ebp L0001: mov ebp, esp L0003: vzeroupper L0006: lea eax, [ecx+0x10] L0009: cmp eax, edx L000b: ja short L0071 -> L000d: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)] -> L0015: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)] L001d: vlddqu xmm2, [ecx] L0021: vpcmpgtb xmm0, xmm2, xmm0 L0025: vpcmpgtb xmm1, xmm1, xmm2 L0029: vpsubb xmm0, xmm0, xmm1 L002d: vptest xmm0, xmm0 L0032: jne short L0071 L0034: lea eax, [ecx+0x20] L0037: cmp eax, edx L0039: ja short L006a -> L003b: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)] -> L0043: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberString|0_1(Byte*, Byte*)] L004b: vlddqu xmm2, [ecx+0x10] L0050: vpcmpgtb xmm0, xmm2, xmm0 L0054: vpcmpgtb xmm1, xmm1, xmm2 L0058: vpsubb xmm0, xmm0, xmm1 L005c: vptest xmm0, xmm0 L0061: jne short L006a L0063: mov eax, 2 L0068: pop ebp L0069: ret L006a: mov eax, 1 L006f: pop ebp L0070: ret L0071: xor eax, eax L0073: pop ebp L0074: ret
What C# does is to create the “vectors” made of the values 47 and 58 twice each (for a total of four times).
There might be a clever way to get C# to stop being inefficient, but you can also just manually inline. That is you create one function that includes the other function. The result might look at follows:
// return 2 if 32 digits are found // return 1 if 16 digits are found // otherwise return 1 unsafe static int ParseNumberStringInline(byte* p, byte* pend) { if (p + 16 <= pend) { Vector128<sbyte> ascii0 = Vector128.Create((sbyte)47); Vector128<sbyte> after_ascii9 = Vector128.Create((sbyte)58); Vector128<sbyte> raw = Sse41.LoadDquVector128((sbyte*)p); var a = Sse2.CompareGreaterThan(raw, ascii0); var b = Sse2.CompareLessThan(raw, after_ascii9); var c = Sse2.Subtract(a, b); if((p + 32 <= pend) && Sse41.TestZ(c,c)){ raw = Sse41.LoadDquVector128((sbyte*)p + 16); a = Sse2.CompareGreaterThan(raw, ascii0); b = Sse2.CompareLessThan(raw, after_ascii9); c = Sse2.Subtract(a, b); if(Sse41.TestZ(c,c)) { return 2; } } return 1; } return 0; }
This new code is harder to read and maybe harder to maintain. However, let us look at the compiled output:
<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*) L0000: push ebp L0001: mov ebp, esp L0003: vzeroupper L0006: lea eax, [ecx+0x10] L0009: cmp eax, edx L000b: ja short L0061 L000d: vmovupd xmm0, [<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)] L0015: vmovupd xmm1, [<Program>$.<<Main>$>g__ParseNumberStringInline|0_2(Byte*, Byte*)] L001d: vlddqu xmm2, [ecx] L0021: vpcmpgtb xmm3, xmm2, xmm0 L0025: vpcmpgtb xmm2, xmm1, xmm2 L0029: vpsubb xmm4, xmm3, xmm2 L002d: lea eax, [ecx+0x20] L0030: cmp eax, edx L0032: ja short L005a L0034: vptest xmm4, xmm4 L0039: jne short L005a L003b: vlddqu xmm2, [ecx+0x10] L0040: vpcmpgtb xmm3, xmm2, xmm0 L0044: vpcmpgtb xmm2, xmm1, xmm2 L0048: vpsubb xmm4, xmm3, xmm2 L004c: vptest xmm4, xmm4 L0051: jne short L005a L0053: mov eax, 2 L0058: pop ebp L0059: ret L005a: mov eax, 1 L005f: pop ebp L0060: ret L0061: xor eax, eax L0063: pop ebp L0064: ret
I do not expect you to read this gibberish, but notice how the result is now tighter. It is also going to be slightly faster.
As a rule of thumb, if you look at the assembly output of your code, and it is shorter, it is usually the code that you will have better performance. It is simply the case that executing fewer instructions is often faster.
Though my example is a toy example, you should expect the csFastFloat library to benefit from SIMD instructions in the near future. The preliminary numbers I have seen suggest a nice performance bump. There is a pull request.
In early 2000s when I first got interested in compression algorithms for IR, I had found that a C++ compiler often refused to inline functions, which was detrimental to performance. So, I rewrote all these little functions as macros. Fun-fun. Now C++ compilers are more clever and this kinda of stuff is less useful (plus implementing everything using macros is difficult, you need to use some ugly hacks and things like ##), I guess it may still fail sometimes. And, of course, such failures are probably more common in C# and Java, which were always less geared towards efficiency compared to C++.
Interesting that it is generating 32 bit code instead of 64 bit Intel code.
As regular reader, I noticed recently, an erratic special character made your posts temporarily disappear from my handcrafted blog reader. I’ve waited abit for correction. Please fix using validation from https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Flemire.me%2Fblog%2Ffeed%2F Thank you.
The correct answer to “For software performance, can you always trust X” is “No”, for any value of X.
We have taken a note of this and opened https://github.com/dotnet/runtime/pull/61408 to fix the issue, but I believe the C# code itself to detect if there is ASCII or not can be simplified using MoveMask. Here is an example of how we do it :
https://github.com/dotnet/runtime/blob/e64cce6fbbde72d1ea61010bd08f11ba7b51afc9/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs#L290-L297
The proposed function (in the blog post) aims to determine if 16 bytes are made of ASCII digits. I am also not claiming that it is optimal (or even correct).