-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silent crash or failure when remoting. #9395
Comments
See again on #9398. |
I've often seen Pants output that ~20 tests have succeeded, and then it will silently crash. I believe I have only seen this in the remote execution shards. |
I got output for the first time!
|
### Problem #9395 occurs within our `grpcio` dependency, which is quite stale. Although #9395 is more likely to be related to changes in our executor (#9071) or to our transitive rust dependencies (#9122), getting on a more recent version of the `grpcio` crate _might_ resolve the issue, or make it easier to report an issue. ### Solution Bump to `0.5.1` with one patch (tikv/grpc-rs#457) pulled forward from our previous fork. [ci skip-jvm-tests] # No JVM changes made.
I believe that I saw this again. EDIT: And again, but this time with:
EDIT: rust-lang/rust#51245 allows for causing this to panic rather than just aborting, which would get us the stacktrace attempting to allocate that much. It seems to be a strangely consistent size though (~190 MB). Possibly a particular input/output we have that is 190MB...? |
I encountered this today with no extra information. |
Some whack a mole style debug output turned up some culprits: https://api.travis-ci.com/v3/job/320301504/log.txt . We're producing some very large outputs it seems. Will look this evening.
|
Waited too long to investigate and those digests appear to be gone from the remote store. Will try again to repro. |
The 190MB digest is for a The next question might be to ask why we would ever be holding multiple copies of it at once (AFAIK, we shouldn't be.) Will need to do some local debugging. |
I haven't been able to reproduce this locally yet under a profiling, but I've been brewing a hypothesis that we're racing to upload 32 copies of things in some cases when running integration tests. Will confirm that, but if so there are a few tacks:
|
As a lay observer, #3 sounds the most compelling because it would presumably remove unnecessary work. |
Encountered this morning https://api.travis-ci.com/v3/job/332126695/log.txt:
And then it fails. |
I imagine this taking the form of a RemoteCasTracker which is basically:
Each time an upload is attempted, we'd check the RemoteCasTracker - if it's ProbablyUploaded we'll skip it, if it's Uploading we'll return a clone of the upload future, and if we get definite information that it isn't uploaded we'll trigger an upload and set it to the Uploading state. |
### Problem The LMDB API allows for direct references into its storage, which will always be MMAP'ed. But because it has a blocking API, we interact with it on threads that have been spawned for that purpose. To allow for interacting with the database in a way that avoids copying data into userspace, the `load_bytes_with` method takes a function that will be called on the blocking thread with a reference directly to that memory. This allows for minimizing copies of data, and avoiding holding full copies of database entries in memory. But at some point a while back (before async/await made dealing with references easy again), `load_bytes_with` started passing a `Bytes` instance to its callback. And constructing a `Bytes` instance with anything other than a static memory reference (which this isn't) requires copying into the `Bytes` instance to give it ownership. This meant that we weren't actually taking advantage of the odd shape of `load_bytes_with`. ### Solution Switch the local `load_bytes_with` back to providing a reference into the database's owned memory, and document the reason for its shape. In separate commits, port the `serverset` crate and code that touches it to async/await. Finally, adjust the remote `store_bytes` to accept a reference to avoid potential copies there as well. ### Result Less memory usage, and less copying of data. In particular: cases that read from the local database in order to copy elsewhere (such as `materialize_directory`, `ensure_remote_has_recursive`, and BRFS `read`) will now copy data directly out of the database and into the destination. `ensure_remote_has_recursive` should now hold onto only [one chunk's worth](https://github.com/pantsbuild/pants/blob/4f45871814c15c0f41b335fc80f98780b9b38d92/src/python/pants/option/global_options.py#L752-L758) of data at a time per file it is uploading, which might be sufficient to fix #9395 (although the other changes mentioned there will likely still be useful).
No repros of this since #9793. We'll likely still want to do one or both of the other things from #9395 (comment) at some point, but will wait a bit longer before closing this and opening a followup.
Might also be able to get away with an async |
No repros in a few weeks! I've moved the optimization ideas to #9960, and am resolving this as fixed. |
The subject and labels may be misleading - there is not much to go on here. For posterity in case this becomes frequent though, details follow.
Seen on a master CI burn in the 'Integration tests - V2 (Python 3.6)' shard:
The combination of
travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py
and no output from the test command after having scheduled all tests is the interesting bit. Seems like a Pant crash with no details at all.The text was updated successfully, but these errors were encountered: