Perhaps there are some types of transitions that take longer, however, e.g,. if the voltage needs to change.

There is also another type of transition where the frequency doesn’t change, but the “upper lanes” of the ALUs are powered up, which occurs on some chips if you don’t run 256-bit instructions for a while, then you run one. Agner describes it at the end of this comment on his blog. This transition is also in the “microseconds” not “milliseconds” range.

]]>I.e., one way to extend the multiply-shift scheme to larger words is decompose it explicitly into smaller hashes and concatenate the hashes together to get your word-sized hash, another another way is to just perform the multiply-shift formula once with the full word size and rely on multi-precision arithmetic to calculate the answer when the needed arithmetic exceeds the machine’s word size.

The end up scaling in about the same way: both are quadratic in the word size (at least as I understand Daniel’s approach) and in fact ultimately produce similar operations.

The explicit decomposition approach has the advantage that you can perhaps rely on some knowledge of the required result to eliminate or combine operations. For example, Daniel’s approach only uses 4 multiplications while a naive 64*128->128 multiplication would need 8 (I think). Essentially the decomposition was able to work around the lack of a way to get the high-half of a 64*64-bit multiplication in Java.

The multi-precision arithmetic approach has the advantage of being able to lean on existing multi-precision libraries[1], which may end up being faster (i.e., if they know how to get a 64*64->128-bit result, you cut the multiplications down to 2 as in my suggestion above), and being obviously correct without proof since they use the original multiply-shift-add formula directly.

[1] Of course if you actually use some big heavy multi-precision library maybe that’s also a downside.

]]>
By the time you get to Gold, you can have several cores running

AVX-512 instructions before the core frequency drops below base

frequency. On Platinum you can have more than half the cores running

AVX instructions and not see an impact.

I think this part embeds a wrong assumption. Yes, you can have several cores running AVX-512 without seeing a drop below *base* frequency, but there is nothing really special about base frequency: it’s just the number on the box and Intel makes some kind of loose guarantees about it – but it is almost irrelevant for most code.

Most cores are going going to be running at “turbo” frequencies most of the time, including “AVX-512 turbo” which may be above or below base – but the logical comparison is between AVX-512 turbo and the scalar (non-AVX) turbo, not between AVX turbo and base. Or more precisely, the right comparison is between the actual frequencies, both for AVX-using and non-AVX code and the turbo frequencies are good proxies for those.

So regarding: “On Platinum you can have more than half the cores running AVX instructions **and not see an impact.**” – the bolded part is not true: almost chips will suffer a frequency impact relative to the scalar case as soon as they run some heavy AVX-256 or any AVX-512: only relative to the arbitrary base frequency is there no “impact”.

Not being an all-seeing oracle isn’t a “bug”. Despite lazy programmers’ desire to “leave it to the hardware”, manual optimization will always be a thing.

]]>Given that, could you calculate your 64-bit strongly universal hash more directly, using only two multiplications if you had access to the full 128-bit result of a 64 * 64 multiplication?

If I understand it correctly, you are using the multiply-shift scheme to calculate an N-bit hash of an N-bit input x using: (a*x + b) >> N, where a and b are 2N-bit values and x is N-bit.

So given N = 64, on most hardware we can do a 64 * 64 -> 128 bit multiplication, but for the a * x part we actually need 128 * 64 -> 128, which could be implemented with two 64 * 64 -> 128 multiplies and a 64-bit add. Then you have two more adds for the + b part, although in principle it seems like adding the bottom half of b is mostly pointless because it only very weakly affects the actual result since the bottom bits are all thrown away [1]. The shift >> 64 is a no-op, since we’re splitting the 128-bit result into two 64-bit registers, so we just directly use the top half.

So you end up with half the number of multiplies and fewer additions as well.

The trick, of course, is convincing the language of your choice to generate sensible code under the covers that actually uses the hardware capabilities! From C or C++ this is somewhat easier due to the presence of 128-bit types for most compilers. I’m not sure about Java though.

[1] Of course, you’d have to walk through the strong universality proof to see if dropping the low-bit addition breaks the guarantee in theory. Perhaps it ends up being strongly 1.0000001-universal or something.

]]>” if you knew that h(x) and h(x’) could not collide”

This is what I did notice. for x’ = x + dx where dx is a small integer there are ZERO (32 bit) collisions whereas murmur and random produce ~same number collisions.

Code:

https://www.jdoodle.com/online-java-compiler#&togetherjs=01tRdC7ssr ]]>

On the low end Bronze chips, just having 1 single core running an

AVX-512 instruction is enough to drop the base frequency of the chip

down to 800Mhz.

You probably understand this, but to clarify for others, “chip” here means just that particular core, and not all the cores on the CPU. The belief (likely correct) is that if you use single “heavy” AVX512 instruction (such as a 512-bit multiplication), that particular core will momentarily be slowed down to 800 MHz. Do we know how long “momentary” is here, and what transition penalty is?

Thus if you are performing a task that is already at maximum IPC, you would expect a greater than 50% slowdown. On the other hand, if you are already slowed down by memory accesses, you might not notice anything even on Xeon Bronze. So while it should be possible to come up with a benchmark that shows the full impact, it might not be easy to come up with one that doesn’t feel “artificial”.

]]>So the theory is that my benchmark simply does not use enough cores. But I already ran Vlad’s multicore benchmark on this same hardware and saw no effect.

I am inviting reproducible benchmarks, I will gladly run them.

]]>