Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent crash or failure when remoting. #9395

Closed
jsirois opened this issue Mar 25, 2020 · 14 comments · Fixed by #9793
Closed

Silent crash or failure when remoting. #9395

jsirois opened this issue Mar 25, 2020 · 14 comments · Fixed by #9793
Assignees
Labels

Comments

@jsirois
Copy link
Contributor

jsirois commented Mar 25, 2020

The subject and labels may be misleading - there is not much to go on here. For posterity in case this becomes frequent though, details follow.

Seen on a master CI burn in the 'Integration tests - V2 (Python 3.6)' shard:

$ travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6
/home/travis/.pex/installed_wheels/b310034b9aeb62e68d8f271e153def66197e9d0d/toml-0.10.0-py2.py3-none-any.whl/toml/decoder.py:47: DeprecationWarning: invalid escape sequence \.
  TIME_RE = re.compile("([0-9]{2}):([0-9]{2}):([0-9]{2})(\.([0-9]{3,6}))?")
17:53:30 [WARN] Unmatched glob from tests/python/pants_test/backend/jvm/tasks:ivy_resolve_integration's `sources` field: "tests/python/pants_test/backend/jvm/tasks/test_ivy_resolve_integration.py"
17:53:38 [INFO] Starting tests: contrib/buildrefactor/tests/python/pants_test/contrib/buildrefactor:buildozer_integration
...
17:53:38 [INFO] Starting tests: tests/python/pants_test/goal:run_tracker_integration
[=== 00:00 CI BEGINS ===]
[=== 00:00 Setting interpreter constraints to ['CPython==3.6.*'] ===]
IntegrationTestsV2Remote
[=== 00:36 Running integration tests via V2 remote strategy. ===]
Integration test failure (V2 remote)
travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py error="exit status 1"
The command "travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6" exited with 1.

The combination of travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py and no output from the test command after having scheduled all tests is the interesting bit. Seems like a Pant crash with no details at all.

@jsirois
Copy link
Contributor Author

jsirois commented Mar 25, 2020

See again on #9398.

@Eric-Arellano
Copy link
Contributor

I've often seen Pants output that ~20 tests have succeeded, and then it will silently crash. I believe I have only seen this in the remote execution shards.

@Eric-Arellano
Copy link
Contributor

I got output for the first time!

06:23:09 [INFO] Tests succeeded: tests/python/pants_test/backend/jvm/tasks/reports
06:23:11 [INFO] Tests succeeded: contrib/cpp/tests/python/pants_test/contrib/cpp:cpp_toolchain
E0327 06:23:32.125356601 6544 chttp2_transport.c:754] client stream 1 still included in list 1

@stuhood stuhood self-assigned this Apr 4, 2020
stuhood added a commit that referenced this issue Apr 5, 2020
### Problem

#9395 occurs within our `grpcio` dependency, which is quite stale. Although #9395 is more likely to be related to changes in our executor (#9071) or to our transitive rust dependencies (#9122), getting on a more recent version of the `grpcio` crate _might_ resolve the issue, or make it easier to report an issue.

### Solution

Bump to `0.5.1` with one patch (tikv/grpc-rs#457) pulled forward from our previous fork. 

[ci skip-jvm-tests]  # No JVM changes made.
@stuhood
Copy link
Member

stuhood commented Apr 11, 2020

I believe that I saw this again.

EDIT: And again, but this time with:

memory allocation of 190881989 bytes failed

EDIT: rust-lang/rust#51245 allows for causing this to panic rather than just aborting, which would get us the stacktrace attempting to allocate that much. It seems to be a strangely consistent size though (~190 MB). Possibly a particular input/output we have that is 190MB...?

@Eric-Arellano
Copy link
Contributor

I encountered this today with no extra information.

@stuhood
Copy link
Member

stuhood commented Apr 17, 2020

Some whack a mole style debug output turned up some culprits: https://api.travis-ci.com/v3/job/320301504/log.txt . We're producing some very large outputs it seems. Will look this evening.

03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)

@stuhood
Copy link
Member

stuhood commented Apr 21, 2020

Waited too long to investigate and those digests appear to be gone from the remote store. Will try again to repro.

@stuhood
Copy link
Member

stuhood commented Apr 21, 2020

The 190MB digest is for a native_engine.so. Not too surprising, in retrospect.

The next question might be to ask why we would ever be holding multiple copies of it at once (AFAIK, we shouldn't be.) Will need to do some local debugging.

@stuhood
Copy link
Member

stuhood commented May 12, 2020

I haven't been able to reproduce this locally yet under a profiling, but I've been brewing a hypothesis that we're racing to upload 32 copies of things in some cases when running integration tests. Will confirm that, but if so there are a few tacks:

  1. Add a generalized async semaphore to throttle how much we hold in memory during these operations
  2. Avoid holding these in memory while we upload them (they're in LMDB: no need to actually hold them in userspace while uploading)
  3. Add some per-file memoization to de-dupe uploads

@Eric-Arellano
Copy link
Contributor

As a lay observer, #3 sounds the most compelling because it would presumably remove unnecessary work.

@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented May 13, 2020

Encountered this morning https://api.travis-ci.com/v3/job/332126695/log.txt:

18:56:46:483 [INFO] Starting tests: tests/python/pants_test/backend/project_info/tasks:export_dep_as_jar_integration
memory allocation of 194079374 bytes failed�

And then it fails.

@illicitonion
Copy link
Contributor

3. Add some per-file memoization to de-dupe uploads

I imagine this taking the form of a RemoteCasTracker which is basically:

enum RemoteState  {
  ProbablyUploaded,
  Uploading(SharedFuture<()>),
}

struct RemoteCasTracker {
  state: LruCache<Digest, RemoteState>
}

Each time an upload is attempted, we'd check the RemoteCasTracker - if it's ProbablyUploaded we'll skip it, if it's Uploading we'll return a clone of the upload future, and if we get definite information that it isn't uploaded we'll trigger an upload and set it to the Uploading state.

stuhood added a commit that referenced this issue May 16, 2020
### Problem

The LMDB API allows for direct references into its storage, which will always be MMAP'ed. But because it has a blocking API, we interact with it on threads that have been spawned for that purpose.

To allow for interacting with the database in a way that avoids copying data into userspace, the `load_bytes_with` method takes a function that will be called on the blocking thread with a reference directly to that memory. This allows for minimizing copies of data, and avoiding holding full copies of database entries in memory.

But at some point a while back (before async/await made dealing with references easy again), `load_bytes_with` started passing a `Bytes` instance to its callback. And constructing a `Bytes` instance with anything other than a static memory reference (which this isn't) requires copying into the `Bytes` instance to give it ownership. This meant that we weren't actually taking advantage of the odd shape of `load_bytes_with`.

### Solution

Switch the local `load_bytes_with` back to providing a reference into the database's owned memory, and document the reason for its shape. In separate commits, port the `serverset` crate and code that touches it to async/await. Finally, adjust the remote `store_bytes` to accept a reference to avoid potential copies there as well.

### Result

Less memory usage, and less copying of data. In particular: cases that read from the local database in order to copy elsewhere (such as `materialize_directory`, `ensure_remote_has_recursive`, and BRFS `read`) will now copy data directly out of the database and into the destination.

`ensure_remote_has_recursive` should now hold onto only [one chunk's worth](https://github.com/pantsbuild/pants/blob/4f45871814c15c0f41b335fc80f98780b9b38d92/src/python/pants/option/global_options.py#L752-L758) of data at a time per file it is uploading, which might be sufficient to fix #9395 (although the other changes mentioned there will likely still be useful).
@stuhood stuhood reopened this May 16, 2020
@stuhood
Copy link
Member

stuhood commented May 20, 2020

No repros of this since #9793. We'll likely still want to do one or both of the other things from #9395 (comment) at some point, but will wait a bit longer before closing this and opening a followup.

I imagine this taking the form of a RemoteCasTracker which is basically:

Might also be able to get away with an async Mutex around the result value, which might make it easier to recover from failure.

@stuhood
Copy link
Member

stuhood commented Jun 4, 2020

No repros in a few weeks! I've moved the optimization ideas to #9960, and am resolving this as fixed.

@stuhood stuhood closed this as completed Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants