Wow, thanks Chris, that's something of an education. I'm very surprised by those metrics.

I'll run through the reasoning that led me to expect totally different results. There might be a bug in the following pseudo machine code but I'm sure you'll get the drift.

I would expect the if version to work something like this...

Code: Select all

```
load value
subtract 1
branch to A if sign negative
load 1
branch to Exit
A:
load value
add 1
branch to B if sign negative
load value
branch to Exit
B:
load -1
Exit:
```

So if value >= 1 this would take 5 cycles

if value <= -1 this would take 8 cycles

if value < 1 and value > -1 this would take 7 cycles

While for the obscure version I would expect something like the following...

Code: Select all

```
load register 1 with value
add 1 to register 1
branch to A if register 1 positive
negate register 1
A:
load register 2 with value
subtract 1 from register 2
branch to B if register 2 positive
negate register 2
B:
subtract register 2 from register 1
multiply register 1 by 0.5
```

So if value >= 1 then 8

if value <= -1 then 10

if value < 1 and value > -1 then 9

So more cycles and one would also expect a double precision multiplication to take longer to execute that the other "virtual cycles" I'm using in this model.

Trying to analyse why I'm so wrong, I can see that the abs() probably doesn't use a branch these days as it'll be a hardware function in modern FPUs and will probably execute much faster than in the old days.

But I'm still really surprised by those metrics. Instruction caching should make such small amounts of code work smoothly from an instruction fetch POV. I guess I'm underestimating the impact on pipelining of branching. I wonder if the random data you are using makes any difference to how well the branch prediction works. More coherent/montonic data might have different results but I doubt it would turn around that surprising 20:1 result.

Just as a matter of interest what CPU did you run that test on?

Anyway, thanks for opening my eyes.