-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeouts to cache lookups and handle server eofs #79
Conversation
Could you write something to stderr/somewhere else visible on buildbot when we get unexpected EOF errors, so we can have some monitoring over them? |
@arielb1 yeah unfortunately architecturally sccache is slightly difficult to debug since it primarily runs as a daemon (without stderr/stdout). Note though that it supports redirection of stderr in the daemon which we're leveraging and printing. (those logs are how I discovered this issue) |
d990325
to
baf9610
Compare
I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's see if that helps rust-lang#40240!
travis: Update sccache binary I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's see if that helps #40240!
travis: Update sccache binary I've built a local copy with mozilla/sccache#79 and mozilla/sccache#78. Let's see if that helps #40240!
@luser FWIW I've deployed this to rust-lang/rust and it seems to have at least erased that particular failure mode! |
@alexcrichton That's great to hear! I have been trying to get my Rust PR landable, I will finish that off so we can get everything else rebased and landed. |
src/commands.rs
Outdated
// get a response then we just forge ahead locally and run the | ||
// compilation ourselves. | ||
Err(ProtobufError::IoError(ref e)) | ||
if e.kind() == io::ErrorKind::UnexpectedEof => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a log message here so it's easy to figure out when this happens? Actually I think even straight printing something to stderr would be OK in this case, it feels like a bug if we hit this in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, just a few little things that would be good. Feel free to rebase and land this.
src/compiler/compiler.rs
Outdated
@@ -355,6 +360,20 @@ impl Compiler { | |||
storage.get(&key) | |||
}; | |||
|
|||
// Wait at most a minute for the cache to respond before we forge | |||
// ahead ourselves with a compilation. | |||
let timeout = Duration::new(60, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a followup, it might be nice to have a way to configure this so we can tweak it in practice.
@@ -575,6 +576,9 @@ impl<C> SccacheService<C> | |||
stats.cache_misses += 1; | |||
stats.forced_recaches += 1; | |||
} | |||
MissType::TimedOut => { | |||
stats.cache_misses += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add a stat for this as well so we can track it?
I believe this is possible to hit if the compilation takes quite a long time and times out the server itself. The server may shut down due to being idle while we're still connected and cause us to receive an `UnexpectedEof`.
This should help ensure that we don't wait *too* long for the cache to respond (for example on an excessively slow network) and time out the server unnecessarily.
baf9610
to
b91b81a
Compare
We've unfortunately been having a lot of trouble over at rust-lang/rust#40240 with sccache causing spurious failures. A number of logs are linked from that issue and to try to diagnose these issues I've enabled sccache logging which we cat to the Travis logs on failures.
Today we got a new failure whose failing build pointed to:
Upon inspection of the sccache server debug logs (at the end of the build log) I noticed the failing object was
ELFDumper.cpp.o
, we attempted to fetch this object from the cache (I think), and no other messages aboutELFDumper.cpp.o
(success or not) happened on the server. My theory for what happened looks like:To handle this failure mode this PR makes a few modifications
UnexpectedEof
the same way it handles "unhandled compilation", it just compiles the object locally. This should protect against this in the future and other spurious races like this which are generally benign.