Modern game controllers can point in a wide range of directions. Game designers sometimes want to convert the joystick direction to get 8-directional movement. A typical solution offered is to compute the angle, round it up and then compute back the direction vector.

double angle = atan2(y, x); angle = (int(round(4 * angle / PI + 8)) % 8) * PI / 4; xout = cos(angle); yout = sin(angle);

If you assume that the unit direction vector is in the first quadrant (both x and y are positive), then there is a direct way to compute the solution. Using 1/sqrt(2) or 0.7071 as the default solution, compare both x and y with sin(3*pi/8), and only switch them to 1 if they are larger than sin(3*pi/8) or to 0 if the other coordinate is larger than sin(3*pi/8). The full code looks as follows:

double outx = 0.7071067811865475; // 1/sqrt(2) double outy = 0.7071067811865475;// 1/sqrt(2) if (x >= 0.923879532511286) { // sin(3*pi/8) outx = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outy = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outx = 0; } if (x >= 0.923879532511286) { // sin(3*pi/8) outy = 0; } if (xneg) { outx = -outx; }

I write tiny *if* clauses because I hope that the compile will avoid producing comparisons and jumps which may stress the branch predictor when the branches are hard to predict.

You can generalize the solution for the case where either x or y (or both) are negative by first taking the absolute value, and then restoring the sign at the end:

bool xneg = x < 0; bool yneg = y < 0; if (xneg) { x = -x; } if (yneg) { y = -y; } double outx = 0.7071067811865475; // 1/sqrt(2) double outy = 0.7071067811865475;// 1/sqrt(2) if (x >= 0.923879532511286) { // sin(3*pi/8) outx = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outy = 1; } if (y >= 0.923879532511286) { // sin(3*pi/8) outx = 0; } if (x >= 0.923879532511286) { // sin(3*pi/8) outy = 0; } if (xneg) { outx = -outx; } if (yneg) { outy = -outy; }

You can rewrite everything with the ternary operator to entice the compiler to produce branchless code (i.e., code without jumps). The result is more compact.

bool xneg = x < 0; x = xneg ? -x : x; bool yneg = y < 0; y = yneg ? -y : y; double outx = (x >= 0.923879532511286) ? 1 : 0.7071067811865475; double outy = (y >= 0.923879532511286) ? 1 : 0.7071067811865475; outx = (y >= 0.923879532511286) ? 0 : outx; outy = (x >= 0.923879532511286) ? 0 : outy; outx = xneg ? -outx : outx; outy = yneg ? -outy : outy;

The clang compiler may produce an entirely branchless assembly given this code.

But as pointed out by Samuel Lee in the comments, you can do even better… Instead of capturing the sign with a separate variable, you can just copy the pre-existing sign using a function like copysign (available in C, C#, Java and so forth):

double outx = fabs(x); double outy = fabs(y); outx = (outx >= 0.923879532511286) ? 1 : 0.7071067811865475; outy = (outy >= 0.923879532511286) ? 1 : 0.7071067811865475; outx = (posy >= 0.923879532511286) ? 0 : outx; outy = (posx >= 0.923879532511286) ? 0 : outy; outx = copysign(outx, x); outy = copysign(outy, y);

I wrote a small benchmark that operates on random inputs. Your results will vary but on my mac laptop with LLVM 12, I get that the direct approach with copysign is 50 times faster than the approach with tan/sin/cos.

with tangent | 40 ns/vector |

fast approach | 1.2 ns/vector |

fast approach/copysign | 0.8 ns/vector |

1. Cos(atan(x)) has a nice identity as 1/sqrt(x^2+1), while sin is x/sqrt(x^2+1). What about implementing it that way? In this way you have a branch less solution which can be vectorized (although your scenario suggests there’s only one vector).

2. it seems to me that you don’t normalize the input vector. is the input circle in the unit disk? What happens if both x and y are lower than cos(3pi/8) then the vector will be zero, and you have the dead zone dependent on your rouding. If you assume the unit disk, can some of the branches be filded together in that scenario?

In this way you have a branch less solution which can be vectorizedMy version can be compiled to branchless code and is vectorizable.

it seems to me that you don’t normalize the input vector.We are assuming here that the direction vector has been normalized (it is a ‘unit’ direction vector).

If you assume the unit disk, can some of the branches be filded together in that scenario?You can rewrite the the logic to have fewer apparent branches, but all my branches are subject to becoming mere selection (condition moves).

I certainly do not claim that my code is the most efficient possible. In fact, I am sure you can do better !!!

The original code could avoid both sin and cos by computing as value between 0 and 7 and a switch with hard-coded values. Your code would still be faster vy avoiding atan2, but by a lesser factor.

Remove the if, instead use a ternary operator and then we get branchless code when using clang.

Funny that Clang only successfully goes branchless when using ternary operator instead of if.

https://godbolt.org/z/so5hhqvMo

Correct.

I am sorry, I didn’t get it. Why do you need four comparisons in the first snippet? I think two are enough.

Why do you need four comparisons in the first snippet? I think two are enough.I do not need four, I use four. You are correct that two are sufficient. However, we want to entice the compiler to produce branchless code.

Branchless code may not be faster in absolute terms, but it is has consistent (fixed) performance.

Putting your ‘branch’ results into a two element vector, and casting the (single evaluation) of the comparison to bool() as an index, is also a technique used to more forcibly result in branch-less coding, with a little less reliance on the compiler. Almost certainly not a issue with this small, localised, example but may be useful when you truly want to remove branches.

At least under GCC, the compiled result may be that you move the two values to the stack and then load back the desired value from the stack. It is significantly more inefficient that a conditional move.

Heads up that in the latest version of your code/the post you seem to have lost the sin(PI/8) constant.

A had a couple of thoughts on this just getting around to tidying.

1) Handling the making the doubles positive for comparison before making putting the sign back can be achieved more efficiently with the use of fabs and copysign:

https://godbolt.org/z/aebx9b48a

This should clearly work and give a modest uplift.

2) You actually don’t need to be using floating point at all for this, and possibly can get much better speed using uint64_ts to perform the logic. You can even avoid using flags 🙂 :

https://godbolt.org/z/Yxq3o7hnc

I have not tested this code at all, but I think it should work (assuming you have well-formed inputs – obviously doesn’t handle NaNs for example).

Ah, just read the new code more closely, and understand the change to remove the sin(PI/8) constant – nice! The same could apply to my suggestions

Congratulations, you have the fastest approach. It is under 1 ns on my laptop! I estimate it is about 2.5 cycles per conversion…

I agree that you can almost surely do it without using floats but it could be going too far down the rabbit hole.