The examples are fun, but rather than yet another article saying how amazing optimizing compilers are (they are, I already know), I'd probably benefit more from an article explaining when obvious optimizations are missed and what to do about it.
Some boring examples I've just thought of...
eg 1:
int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.
int foo(char const *s) {
if (strlen(s) < 3) return 0;
if (strcmp(s, "hello") == 0)
return 1;
return 0;
}
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.
I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.
bool is_divisible_by_6(int x) {
return x % 2 == 0 && x % 3 == 0;
}
bool is_divisible_by_6_optimal(int x) {
return x % 6 == 0;
}
Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal.
Mhm, this is one of these cases I'd prefer a benchmark to be sure. Checking %2 is very performant and actually just a single bit check. I can also imagine some cpu's having a special code path for %3. In practice I would not be surprised that the double operand is actually faster than the %6. I am mobile at this moment, so not able to verify.
is_divisible_by_6(int):
test dil, 1
jne .LBB0_1
imul eax, edi, -1431655765
add eax, 715827882
cmp eax, 1431655765
setb al
ret
.LBB0_1:
xor eax, eax
ret
is_divisible_by_6_optimal(int):
imul eax, edi, -1431655765
add eax, 715827882
ror eax
cmp eax, 715827883
setb al
ret
By themselves, the mod 6 and mod 3 operations are almost identical -- in both cases the compiler used the reciprocal trick to transform the modulo into an imul+add+cmp, the only practical difference being that the %6 has one extra bit shift.
But note the branch in the first function! The original code uses the && operator, which is short-circuiting -- so from the compiler's perspective, perhaps the programmer expects that x % 2 will usually be false, and so we can skip the expensive 3 most of the time. The "suboptimal" version is potentially quite a bit faster in the best case, but also potentially quite a bit slower in the worst case (since that branch could be mispredicted). There's not really a way for the compiler to know which version is "better" without more context, so deferring to "what the programmer wrote" makes sense.
That being said, I don't know that this is really a case of "the compiler knows best" rather than just not having that kind of optimization implemented. If we write 'x % 6 && x % 3', the compiler pointlessly generates both operations. And GCC generates branchless code for 'is_divisible_by_6', which is just worse than 'is_divisible_by_6_optimal' in all cases.
Probably not, because a lot of the power of optimizing compilers comes from composing optimizations. Also a lot comes from being able to rule out undefined behavior.
> int bar(int num) { return num / 2; }
>
> Doesn't get optimized to a single shift right, because the that won't work if num is negative.
Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).
Good point. I guess there are more cases than just this one where I'd like to be able to tell the compiler I don't care about rounding behaviour and would prefer the fastest code. Like -ffast-math but for integer operations. I don't think that exists. I wonder why.
I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.
The extra fun thing about this is that eval has different semantics if it's assigned to a different name, in order to allow JavaScript implementations to apply extra optimizations to code that doesn't call a function literally named "eval": https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
The compiler doesn't know the implementation of strlen, it only has its header. At runtime it might be different than at compile time (e.g. LD_PRELOAD=...). For this to be optimized you need link time optimization.
Sadness. Tons of functions from the standard library are special cases by the compiler. The compiler can elide malloc calls if it can prove it doesn't need them, even though strictly speaking malloc has side effects by changing the heap state. Just not useful side effects.
memcpy will get transformed and inlined for small copies all the time.
What happens when you change random functions in your C compiler? The C standard library and compiler are not independent, both make up a C implementation, which behaviour is described by the C standard.
Yes, though it's worth stating that it's a little more nuanced than that, since (for historical, path-dependent reasons) the compiler and libc are often independent projects (and libc often includes a bunch of other stuff beyond what the standard/compiler need).
This is the case, for example, on macOS, FreeBSD, and Linux.
You are right, it depends, whether you write C (from the standard) or a specific dialect from your vendor (everybody does in practice). In the latter case, you need to know about the rules of the compiler. But to allow optimization, these are often similar, so that the compiler assumes these have the behaviour of the implementation, that the compiler is tailored against.
> Cute username, BTW.
Thanks, I was to lazy to think of a real name, so this is the timestamp, I created the account.
The most common reason is to do optimizations such as replacing strlen("hello") with 5 or open-coding strlen (or, more commonly, memcpy or memcmp). If you're linking with a non-conformant strlen (or memcpy or whatever) the usual thing that happens is that you get standards-compliant behavior when the compiler optimizes away the call, but you get the non-conformant behavior you presumably wanted when the compiler compiles a call to your non-conformant function.
But the orthodox answer to such questions is that demons fly out of your nose.
It does. The meaning of certain functions are prescribed by the C standard and the compiler is allowed to expect them to have certain implementations. It can replace them with intrinsics or even remove them entirely. It is of course different for a freestanding implementation.
> I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.
For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.
That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.
You've misrepresented the situation. Turn up the optimiser to `/O2` and MSVC returns 5 directly, too.
> This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.
It's funny how sometimes operating at a higher level of abstraction allows the compiler to optimise the code better: https://godbolt.org/z/EYP5764Mv
In this, the string literal "hello" is lowered not merely into a static string, but a handful of integral immediates that are directly inline in the assembly, no label-dereferencing required, and the 'is equal to "hello"' test is cast as the result of some sign extends and a bitwise-xor.
Of course, one could argue that std::string_view::size() is statically available, but then my counter-argument is that C's zero-terminated strings are a massive pessimisation (which is why the compiler couldn't 'see' what we humans can), and should always be avoided.
`s[0] == 'h'` isn't sufficient to guarantee that `s[3]` can be access without a segfault, so the compiler is not allowed to perform this optimization.
If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb
(also note you got the endianness wrong in your hand-optimized version)
> If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen
But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.
Yes, so that's why the compiler can't and doesn't emit the optimized version if you write the short circuited version - because it behaves differently for short strings.
This is fantastic, thanks! This is the approach I use in httpdito to detect the CRLFCRLF that terminates an HTTP/1.0 GET request, but I'm doing it in assembly.
If you want to tell the compiler not to worry about the possible buffer overrun then you can try `int foo(char const s[static 4])`. Or use `&` instead of `&&` to ensure that there is no short-circuiting, e.g. `if ((s[0] == 'h') & (s[1] == 'e') & (s[2] == 'l') & (s[3] == 'l'))` Either way, this then compiles down to a single 32-bit comparison.
Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.
The code in `bar` is probably not a good idea on targets that don't like unaligned loads.
That's because the 1 instruction variant may read past the end of an array. Let's say s is a single null byte at 0x2000fff, for example (and that memory is only mapped through 0x2001000); the function as written is fine, but the optimized version may page fault.
Since the optimiser is allowed to assume you're not invoking UB, and strlen of null is UB, I don't believe that it would consider that case when optimising this function.
The notion that because it is undefined behavior means that the compiler is free to replace it with anything up to and including "launch nuclear missiles". This is just nuts.
If I program it to cause a null pointer seg fault, I expect a null pointer seg fault. If I program it to cause a twos complement overflow, I want a twos complement overflow.
Yeah, I feel the same way. It's refreshing to hear that that's not just because I'm insane. I think C compiler teams are sort of forced into this stupid shit because they don't have new CPU architectures to port to anymore, so, unless they want to go find new jobs, they're forced to waste their time and everyone else's by "improving" the compilers by increasing performance in riskier and riskier ways.
Some boring examples I've just thought of...
eg 1:
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.eg 2:
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6eg 3:
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.