Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The examples are fun, but rather than yet another article saying how amazing optimizing compilers are (they are, I already know), I'd probably benefit more from an article explaining when obvious optimizations are missed and what to do about it.

Some boring examples I've just thought of...

eg 1:

    int bar(int num) { return num / 2; }
Doesn't get optimized to a single shift right, because the that won't work if num is negative. In this case we can change the ints to unsigneds to tell the compiler we know the number isn't negative. But it isn't always easy to express to the compiler everything you know about your data and use case. There is an art in knowing what kinds of things you need to tell the compiler in order to unlock optimizations.

eg 2:

    int foo(void) { return strlen("hello"); }
We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

eg 3:

    int foo(char const *s) {
      if (strlen(s) < 3) return 0;
      if (strcmp(s, "hello") == 0)
        return 1;
      return 0;
    }
This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it. https://godbolt.org/z/Koj65eo5K. I can think of many reasons the compiler isn't able to spot this.


> We all know that strlen will return 5, but some compilers don't: https://godbolt.org/z/M7x5qraE6

I feel like it is unfair to blame the compiler when you've explicitly asked for `/O1`. If you change this to `/O2` or `/Ox` then MSVC will optimize this into a constant 5, proving that it does "know" that strlen will return 5 in this case.


Fair point. It doesn't do the optimization if you ask to optimize for size '/Os' either.


Yeah, this one as well:

  bool is_divisible_by_6(int x) {
      return x % 2 == 0 && x % 3 == 0;
  }

  bool is_divisible_by_6_optimal(int x) {
      return x % 6 == 0;
  }
Mathematically x % 2 == 0 && x % 3 == 0 is exactly the same as x % 6 == 0 for all C/C++ int values but the compiler doesn't see them as identical, and produces less optimal code for is_divisible_by_6 than for is_divisible_by_6_optimal.


Mhm, this is one of these cases I'd prefer a benchmark to be sure. Checking %2 is very performant and actually just a single bit check. I can also imagine some cpu's having a special code path for %3. In practice I would not be surprised that the double operand is actually faster than the %6. I am mobile at this moment, so not able to verify.


But if % 2 && % 3 is better, then isn't there still a missed optimization in this example?


Let's throw this into godbolt: https://clang.godbolt.org/z/qW3qx13qT

    is_divisible_by_6(int):
        test    dil, 1
        jne     .LBB0_1
        imul    eax, edi, -1431655765
        add     eax, 715827882
        cmp     eax, 1431655765
        setb    al
        ret
    .LBB0_1:
        xor     eax, eax
        ret

    is_divisible_by_6_optimal(int):
        imul    eax, edi, -1431655765
        add     eax, 715827882
        ror     eax
        cmp     eax, 715827883
        setb    al
        ret
By themselves, the mod 6 and mod 3 operations are almost identical -- in both cases the compiler used the reciprocal trick to transform the modulo into an imul+add+cmp, the only practical difference being that the %6 has one extra bit shift.

But note the branch in the first function! The original code uses the && operator, which is short-circuiting -- so from the compiler's perspective, perhaps the programmer expects that x % 2 will usually be false, and so we can skip the expensive 3 most of the time. The "suboptimal" version is potentially quite a bit faster in the best case, but also potentially quite a bit slower in the worst case (since that branch could be mispredicted). There's not really a way for the compiler to know which version is "better" without more context, so deferring to "what the programmer wrote" makes sense.

That being said, I don't know that this is really a case of "the compiler knows best" rather than just not having that kind of optimization implemented. If we write 'x % 6 && x % 3', the compiler pointlessly generates both operations. And GCC generates branchless code for 'is_divisible_by_6', which is just worse than 'is_divisible_by_6_optimal' in all cases.


I also tried this

  bool is_divisible_by_15(int x) {
      return x % 3 == 0 && x % 5 == 0;
  }

  bool is_divisible_by_15_optimal(int x) {
      return x % 15 == 0;
  }
is_divisible_by_15 still has a branch, while is_divisible_by_15_optimal does not

  is_divisible_by_15(int):
        imul    eax, edi, -1431655765
        add     eax, 715827882
        cmp     eax, 1431655764
        jbe     .LBB0_2
        xor     eax, eax
        ret
  .LBB0_2:
        imul    eax, edi, -858993459
        add     eax, 429496729
        cmp     eax, 858993459
        setb    al
        ret

  is_divisible_by_15_optimal(int):
        imul    eax, edi, -286331153
        add     eax, 143165576
        cmp     eax, 286331153
        setb    al
        ret


Nice.

Is the best way to think of optimizing compilers, "I wonder if someone hand wrote a rule for the optimizer that fits this case"?


Probably not, because a lot of the power of optimizing compilers comes from composing optimizations. Also a lot comes from being able to rule out undefined behavior.


> int bar(int num) { return num / 2; } > > Doesn't get optimized to a single shift right, because the that won't work if num is negative.

Nit: some might think the reason this doesn't work is because the shift would "move" the sign bit, but actually arithmetic shifting instructions exist for this exact purpose. The reason they are not enough is because shifting provides the wrong kind of division rounding for negative numbers. This can however be fixed up by adding 1 if the number is negative (this can be done with an additional logical shift for moving the sign bit to the rightmost position and an addition).


Good point. I guess there are more cases than just this one where I'd like to be able to tell the compiler I don't care about rounding behaviour and would prefer the fastest code. Like -ffast-math but for integer operations. I don't think that exists. I wonder why.


Will shift, shift, and add be slower or faster than a divide instruction on machines with a divide instruction?


Most likely no, division instructins generally take as much as 10-20 other arithmetic/logic instruction.


> won't work if num is negative

I remember reading (although I can't find it now) a great analysis of all the optimizations that Javascript compilers _can't_ do because of the existence of the "eval" instruction.


The extra fun thing about this is that eval has different semantics if it's assigned to a different name, in order to allow JavaScript implementations to apply extra optimizations to code that doesn't call a function literally named "eval": https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Andy Wingo (of course!) has a good explanation of this: https://wingolog.org/archives/2012/01/12/javascript-eval-con...


Could this perhaps be it? https://janvitek.org/pubs/ecoop11.pdf


A JIT can do any optimization it wants, as long as it can deoptimize if it turns out it was wrong.


You also want to prove that the „optimization“ doesn’t make things slower.


I don't think anyone actually bothers to do that tbh.


The compiler doesn't know the implementation of strlen, it only has its header. At runtime it might be different than at compile time (e.g. LD_PRELOAD=...). For this to be optimized you need link time optimization.


Both clang and gcc do optimize it though - https://godbolt.org/z/cGG9dq756. You need -fno-builtin or similar to get them to not.


No, the compiler may assume that the behavior of standard library functions is standards-conformant.


> No, the compiler may assume that the behavior of standard library functions is standards-conformant.

Why?

What happens if it isn't?


Sadness. Tons of functions from the standard library are special cases by the compiler. The compiler can elide malloc calls if it can prove it doesn't need them, even though strictly speaking malloc has side effects by changing the heap state. Just not useful side effects.

memcpy will get transformed and inlined for small copies all the time.


What happens when you change random functions in your C compiler? The C standard library and compiler are not independent, both make up a C implementation, which behaviour is described by the C standard.


Yes, though it's worth stating that it's a little more nuanced than that, since (for historical, path-dependent reasons) the compiler and libc are often independent projects (and libc often includes a bunch of other stuff beyond what the standard/compiler need).

This is the case, for example, on macOS, FreeBSD, and Linux.

Cute username, BTW.


You are right, it depends, whether you write C (from the standard) or a specific dialect from your vendor (everybody does in practice). In the latter case, you need to know about the rules of the compiler. But to allow optimization, these are often similar, so that the compiler assumes these have the behaviour of the implementation, that the compiler is tailored against.

> Cute username, BTW.

Thanks, I was to lazy to think of a real name, so this is the timestamp, I created the account.


> What happens if it isn't?

§6.4.2.1: "If the program defines a reserved identifier [...] the behavior is undefined."


The most common reason is to do optimizations such as replacing strlen("hello") with 5 or open-coding strlen (or, more commonly, memcpy or memcmp). If you're linking with a non-conformant strlen (or memcpy or whatever) the usual thing that happens is that you get standards-compliant behavior when the compiler optimizes away the call, but you get the non-conformant behavior you presumably wanted when the compiler compiles a call to your non-conformant function.

But the orthodox answer to such questions is that demons fly out of your nose.


Because that's what it means to compile a specific dialect of a specific programming language?

If you want a dialect where they aren't allowed to assume that you would have to make your own


It does. The meaning of certain functions are prescribed by the C standard and the compiler is allowed to expect them to have certain implementations. It can replace them with intrinsics or even remove them entirely. It is of course different for a freestanding implementation.


Hmmm, really? Switching compiler seems sufficient: https://godbolt.org/z/xnevov5d7

BTW, the case of it not optimizing was MSVC targetting Windows (which doesn't support LD_PRELOAD, but maybe has something similar?).


> I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.

For that you could at least argue that if the libc's strlen is faster than strcmp, that improves performance if the programmer expects the function to be usually called with a short input.

That said, changing it to `if (strlen(s) == 5) return 0;` it still doesn't get optimized (https://godbolt.org/z/7feWWjhfo), even though the entire function is completely equivalent to just `return 0;`.


> but some compilers don't: https://godbolt.org/z/M7x5qraE6

You've misrepresented the situation. Turn up the optimiser to `/O2` and MSVC returns 5 directly, too.

> This function returns 1 if s is "hello". 0 otherwise. I've added a pointless strlen(). It seems like no compiler is clever enough to remove it.

It's funny how sometimes operating at a higher level of abstraction allows the compiler to optimise the code better: https://godbolt.org/z/EYP5764Mv

In this, the string literal "hello" is lowered not merely into a static string, but a handful of integral immediates that are directly inline in the assembly, no label-dereferencing required, and the 'is equal to "hello"' test is cast as the result of some sign extends and a bitwise-xor.

Of course, one could argue that std::string_view::size() is statically available, but then my counter-argument is that C's zero-terminated strings are a massive pessimisation (which is why the compiler couldn't 'see' what we humans can), and should always be avoided.


eg 4:

   int foo(char const *s) {
     if (s[0] == 'h' && s[1] == 'e' && s[2] == 'l' && s[3] == 'l')
        return 1;
     return 0;
   }
The outputs 4 cmp instructions here, even though I'd have thought 1 was sufficient. https://godbolt.org/z/hqMnbrnKe


`s[0] == 'h'` isn't sufficient to guarantee that `s[3]` can be access without a segfault, so the compiler is not allowed to perform this optimization.

If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb

(also note you got the endianness wrong in your hand-optimized version)


> If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen

But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.


Yes, so that's why the compiler can't and doesn't emit the optimized version if you write the short circuited version - because it behaves differently for short strings.


UB doesn't exist in the processor (it does, but not here). If the compiler knows the pointer is aligned it can do the transformation.


For the compiler to know the pointer is aligned it would have to actually be aligned and there is no guarantee that it is.


This is fantastic, thanks! This is the approach I use in httpdito to detect the CRLFCRLF that terminates an HTTP/1.0 GET request, but I'm doing it in assembly.


Ooo, I'd never thought of using & like that. Interesting.

> (also note you got the endianness wrong in your hand-optimized version) Doh :-)


Matt Godbolt's talk on ray tracers, shows how effective that change can be. Think it was that talk anyway.

https://www.youtube.com/watch?v=HG6c4Kwbv4I


good ol' short circuiting


If you want to tell the compiler not to worry about the possible buffer overrun then you can try `int foo(char const s[static 4])`. Or use `&` instead of `&&` to ensure that there is no short-circuiting, e.g. `if ((s[0] == 'h') & (s[1] == 'e') & (s[2] == 'l') & (s[3] == 'l'))` Either way, this then compiles down to a single 32-bit comparison.

Interestingly, it is comparing against a different 32-bit value than `bar` does. I think this is because you accidentally got the order backwards in `bar`.

The code in `bar` is probably not a good idea on targets that don't like unaligned loads.


That's because the 1 instruction variant may read past the end of an array. Let's say s is a single null byte at 0x2000fff, for example (and that memory is only mapped through 0x2001000); the function as written is fine, but the optimized version may page fault.


Ah, yes, good point. I think this is a nice example of "I didn't notice I needed to tell the compiler a thing I know so it can optimize".


`s` may be null, and so the strlen may seg fault.


But that's undefined behavior, so the compiler is free to ignore that possibility.


> so the compiler is free to ignore that possibility

And that's what is wrong. This is the most unfriendly behavior towards the programmer.


Since the optimiser is allowed to assume you're not invoking UB, and strlen of null is UB, I don't believe that it would consider that case when optimising this function.


I understand that, but I don't agree that such optimizer behavior is worth it and I won't put it in my compilers.


I appreciate that greatly.


The notion that because it is undefined behavior means that the compiler is free to replace it with anything up to and including "launch nuclear missiles". This is just nuts.

If I program it to cause a null pointer seg fault, I expect a null pointer seg fault. If I program it to cause a twos complement overflow, I want a twos complement overflow.


Yeah, I feel the same way. It's refreshing to hear that that's not just because I'm insane. I think C compiler teams are sort of forced into this stupid shit because they don't have new CPU architectures to port to anymore, so, unless they want to go find new jobs, they're forced to waste their time and everyone else's by "improving" the compilers by increasing performance in riskier and riskier ways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: