Btw., GCC 8.1 seems to generate:

`compilermod32(unsigned int):`

mov eax, edi

mov edx, -1171354717

mul edx

mov eax, edx

shr eax, 4

imul eax, eax, 22

sub edi, eax

mov eax, edi

ret

whereas GCC trunk generates one less instruction:

`compilermod32(unsigned int):`

mov eax, edi

mov edx, 3123612579

imul rax, rdx

shr rax, 36

imul eax, eax, 22

sub edi, eax

mov eax, edi

ret

PS: And please, I certainly don’t mean to tell you what to do, but since last time you mentioned you have Cannon Lake nearby and I don’t, I think it would be interesting to compare this on it, too.

PPS: One another common case which would benefit from fast remainders is modular multiplication, i.e. `a*b%c`

(all 64-bit unsigned ints, `c`

is “runtime” constant) and the way to currently approach it is to use Montgomery multiplication. Just saying…

Not directly comparable to Pythonista.

]]>Floating point remainders are probably another story entirely.

]]>https://ish.app ]]>

I cannot dictate what is important or interesting to you. But maybe you agree that more than one thing can be interesting at once.

As far as we know, nobody worked out the mathematics for the direct computation of the remainder. Authors simply ignored the issue. Completing the mathematical framework was important, I think. A direct application of this “new math” is the divisibility test which I think is really nice.

*In fact, I claim that the majority of the improvement in the assembly you show is due to the use of the wider operations and more precise reciprocal, rather than the more “direct” calculation method.*

It might be. It is certainly worth exploring. We published our benchmarking code (see link above) and we include in these benchmarks an approach that computes the remainder with a fast “wide” division to get the quotient first. It was often faster than other alternatives, but also generally slower than the direct approach. That does not contradict your “majority” claim… but few people are interested in the second-best approach when the best approach is no more expensive. Because it was unexciting, it is not reported in the published paper even if it appears in our logs and public code.

]]>
The mathematics can be presented as an extension of GM’s approach, and

it is just as general.

The basic mathematics is general, yes — I think everyone agrees that the remainder can be calculated as `frac(n / d) * d`

and in the abstract that’s just as general as `n - trunc(n / d) * d`

.

However, the earlier work (and this one) primarily focus on the *efficient* implementation of these methods on real hardware. That’s the interesting part, if I’m not mistaken?

In that domain, the general target is to implement W-bit division on W-bit machines, and a large amount of the complexity in analysis and much of the runtime cost comes as a result of the difficulty of doing that.

So if you are going to implement only W-bit operations assuming that your machine can do 2W-bit arithmetic, I would certainly consider that less general – and comparing W-bit approaches against 2W-bit approaches doesn’t shed much light on the performance of the underlying techniques since it mixes the two effects together. In fact, I claim that the majority of the improvement in the assembly you show is due to the use of the wider operations and more precise reciprocal, rather than the more “direct” calculation method.

A commenter on this blog post has reported an application of the

mathematical result to 64-bit unsigned division on 64-bit processors.

I didn’t look at it, but if it were true – and it was as efficient as the narrow-input case you illustrate – then my point is basically moot (although it of course still applies to the paper itself, which doesn’t mention this).

]]>And I think it’s rooted in something that people consider Apple hype, but it’s real; namely making items in the interface behave like physical objects. The early version of that, skew,orphism, was about looks, and of limited success. Today’s version is about behavior, not appearance, and I think is hugely successful. It’s the fluid motion, the little bounces, the gravity attraction, that creates the delight.

To the extent that this could be quantified (60 fps, no skipped frames) it was accepted by tech nerds as a real “goal” and aspired to/argued over. But that’s old hat, what matters now is the pseudo-physics; that can’t be quantified, and so it’s dismissed. But I think it’s what makes it all work.

On the Mac I’ve had the same experience. The HW has (finally) become fast enough to maintain the fluidity, and enough of the system SW has picked up the pseudo-physics, and so IMHO MacOS has a lot more of this feel of delight than it did a few years ago, even though superficially it looks very similar.

]]>`(−𝑛) mod 𝑑 = −(𝑛 mod 𝑑)`

we have

`(−𝑛) mod 𝑑 = d − (𝑛 mod 𝑑)`

I’m also curious if there are any newer results on floating point remainder (for my application specifically, a floating point numerator and integer divisor).

]]>