Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: use clang to build release binary. #4138

Closed
wants to merge 2 commits into from

Conversation

PiotrSikora
Copy link
Contributor

Loading 1000 static listeners in a binary built with gcc takes ~51s,
but only ~1.5s (34x faster) in a binary built with clang.

Risk Level: Low
Testing: Manual
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Piotr Sikora piotrsikora@google.com

Loading 1000 static listeners in a binary built with gcc takes ~51s,
but only ~1.5s (34x faster) in a binary built with clang.

*Risk Level*: Low
*Testing*: Manual
*Docs Changes*: n/a
*Release Notes*: n/a

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@PiotrSikora
Copy link
Contributor Author

cc @lizan @htuch @mattklein123

@htuch
Copy link
Member

htuch commented Aug 14, 2018

I'm not opposed to this, just wondering if there's any due diligence that's feasible to ensure we're not regressing elsewhere. Is it commonly accepted that modern Clang now produces more performance code than gcc? I know there were gcc holdouts for a long time because of the reverse, but times changes.

@lizan
Copy link
Member

lizan commented Aug 14, 2018

can you add a line into Release Notes?

@mattklein123
Copy link
Member

mattklein123 commented Aug 14, 2018 via email

@mattklein123
Copy link
Member

mattklein123 commented Aug 14, 2018 via email

@jmarantz
Copy link
Contributor

Quick sanity check: this is for a "-c opt" build?

@PiotrSikora
Copy link
Contributor Author

@htuch / @mattklein123 the source of GCC slowdown is the construction of the LcTrie, and #4117 also fixes the issue, bringing loading time to 0.6s for both compilers, so I suspect that there is a difference in how std::vector::resize() is handled. @lizan has some other guesses, though.

@jmarantz yes.

@htuch
Copy link
Member

htuch commented Aug 14, 2018

Wouldn't std::vector be a libc++ vs. libstdc++ issue? But.. I think we use libstdc++ on Linux for both compilers, so that's probably not right.

@mattklein123
Copy link
Member

I think we use libstdc++ on Linux for both compilers, so that's probably not right.

Yeah pretty positive this is true.

@mattklein123
Copy link
Member

Anyway, IMO we should close this and have a larger conversation about:

  1. Switch to clang
  2. Switch to clang 6 and clang-format 6
  3. Enable flto/whole program optimization

@PiotrSikora
Copy link
Contributor Author

re 1. What's the point of closing this PR if we want to discuss this?
re 2. See #3937, I have clang-6.0 and clang-format-6.0 PRs in my local tree, but only for Ubuntu.

@mattklein123
Copy link
Member

re 1. What's the point of closing this PR if we want to discuss this?

Because IMO we should open an issue and discuss pros/cons there and an action plan. But it doesn't matter that much if we want to do it in the context of this issue. I do think that there should be some other perf tests run before we make the switch (not sure what they are, potentially we can smoke test a build at Lyft and @brian-pane @derekargueta @rgs1 can take a look at a clang build also)?

Also per @lizan definitely should have a release note on this. I agree that moving to 6.0 and considering flto can happen in a different issue/PR.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@lizan
Copy link
Member

lizan commented Aug 14, 2018

I did a bit more investigation on this, it looks like it is a bug that gcc didn't optimize LcNode to uint32_t. And seems it is fixed in somewhere between gcc 6.3 and 8.1.

Wrote a benchmark and here is the result:

Run on (32 X 2200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x16)
  L1 Instruction 32K (x16)
  L2 Unified 256K (x16)
  L3 Unified 56320K (x1)
-------------------------------------------------------------------------
Benchmark                                  Time           CPU Iterations
-------------------------------------------------------------------------
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
StructVectorInitialization          24702330 ns   24687014 ns         28
Uint32VectorInitialization           1526238 ns    1526120 ns        456

clang version 5.0.1-2~bpo9+1 (tags/RELEASE_501/final)
StructVectorInitialization            376977 ns     376737 ns       1866
Uint32VectorInitialization            369947 ns     369918 ns       1917

gcc-8 (Ubuntu 8.1.0-5ubuntu1~16.04) 8.1.0
StructVectorInitialization            965249 ns     965177 ns        724
Uint32VectorInitialization           1523338 ns    1522336 ns        462

#4117 brings this to unnoticeable difference by removing the fixed the minimum 2000000 allocation. Perhaps a complicated listener with many filter chains will see the difference?

@PiotrSikora
Copy link
Contributor Author

@lizan nice find, thanks! Since you already have the benchmark to reproduce this, what happens if you change the definition of LcNode to:

struct LcNode {
    unsigned int branch_ : 5;
    unsigned int skip_ : 7;
    unsigned int address_ : 20;
};

Does it it fix the issue with older versions of GCC?

@PiotrSikora
Copy link
Contributor Author

Nevermind, I've missed the linked benchmark - using unsigned int doesn't fix the issue.

@mattklein123
Copy link
Member

@PiotrSikora we discussed in the community meeting and we would like to close this issue for now. Can you a) prep a PR to upgrade us to clang 6 (not controversial) and b) open an issue where we can discuss changing the default build to clang and a few other things like stack guards and flto? Thank you!

@PiotrSikora
Copy link
Contributor Author

Done (#4157, with rest to follow), and done (#4158, #4159).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants