Silent crash or failure when remoting. #9395

jsirois · 2020-03-25T18:52:33Z

The subject and labels may be misleading - there is not much to go on here. For posterity in case this becomes frequent though, details follow.

Seen on a master CI burn in the 'Integration tests - V2 (Python 3.6)' shard:

$ travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6
/home/travis/.pex/installed_wheels/b310034b9aeb62e68d8f271e153def66197e9d0d/toml-0.10.0-py2.py3-none-any.whl/toml/decoder.py:47: DeprecationWarning: invalid escape sequence \.
  TIME_RE = re.compile("([0-9]{2}):([0-9]{2}):([0-9]{2})(\.([0-9]{3,6}))?")
17:53:30 [WARN] Unmatched glob from tests/python/pants_test/backend/jvm/tasks:ivy_resolve_integration's `sources` field: "tests/python/pants_test/backend/jvm/tasks/test_ivy_resolve_integration.py"
17:53:38 [INFO] Starting tests: contrib/buildrefactor/tests/python/pants_test/contrib/buildrefactor:buildozer_integration
...
17:53:38 [INFO] Starting tests: tests/python/pants_test/goal:run_tracker_integration
[=== 00:00 CI BEGINS ===]
[=== 00:00 Setting interpreter constraints to ['CPython==3.6.*'] ===]
IntegrationTestsV2Remote
[=== 00:36 Running integration tests via V2 remote strategy. ===]
Integration test failure (V2 remote)
travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py error="exit status 1"
The command "travis-wait-enhanced --timeout 65m --interval 9m -- ./build-support/bin/ci.py --integration-tests-v2 --remote-execution-enabled --python-version 3.6" exited with 1.

The combination of travis-wait-enhanced FTL Non-zero exit code for ./build-support/bin/ci.py and no output from the test command after having scheduled all tests is the interesting bit. Seems like a Pant crash with no details at all.

The text was updated successfully, but these errors were encountered:

jsirois · 2020-03-25T20:13:58Z

See again on #9398.

Eric-Arellano · 2020-03-25T20:20:45Z

I've often seen Pants output that ~20 tests have succeeded, and then it will silently crash. I believe I have only seen this in the remote execution shards.

Eric-Arellano · 2020-03-27T06:48:08Z

I got output for the first time!

06:23:09 [INFO] Tests succeeded: tests/python/pants_test/backend/jvm/tasks/reports
06:23:11 [INFO] Tests succeeded: contrib/cpp/tests/python/pants_test/contrib/cpp:cpp_toolchain
E0327 06:23:32.125356601 6544 chttp2_transport.c:754] client stream 1 still included in list 1

### Problem #9395 occurs within our `grpcio` dependency, which is quite stale. Although #9395 is more likely to be related to changes in our executor (#9071) or to our transitive rust dependencies (#9122), getting on a more recent version of the `grpcio` crate _might_ resolve the issue, or make it easier to report an issue. ### Solution Bump to `0.5.1` with one patch (tikv/grpc-rs#457) pulled forward from our previous fork. [ci skip-jvm-tests] # No JVM changes made.

stuhood · 2020-04-11T19:43:17Z

I believe that I saw this again.

EDIT: And again, but this time with:

memory allocation of 190881989 bytes failed

EDIT: rust-lang/rust#51245 allows for causing this to panic rather than just aborting, which would get us the stacktrace attempting to allocate that much. It seems to be a strangely consistent size though (~190 MB). Possibly a particular input/output we have that is 190MB...?

Eric-Arellano · 2020-04-16T14:59:53Z

I encountered this today with no extra information.

stuhood · 2020-04-17T16:58:15Z

Some whack a mole style debug output turned up some culprits: https://api.travis-ci.com/v3/job/320301504/log.txt . We're producing some very large outputs it seems. Will look this evening.

03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:42 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ae6e42ec21a27cb68f9b3dac51513f418a526587eedb09a8d0305838cd3b21f4>, 191007653)
03:10:43 [INFO] local load_bytes_with for File: Digest(Fingerprint<ce25a37f2574e3d21137135964528f404c6031b588a587ae211cfdf292a5e668>, 58541569)

stuhood · 2020-04-21T02:30:02Z

Waited too long to investigate and those digests appear to be gone from the remote store. Will try again to repro.

stuhood · 2020-04-21T16:44:16Z

The 190MB digest is for a native_engine.so. Not too surprising, in retrospect.

The next question might be to ask why we would ever be holding multiple copies of it at once (AFAIK, we shouldn't be.) Will need to do some local debugging.

stuhood · 2020-05-12T16:46:32Z

I haven't been able to reproduce this locally yet under a profiling, but I've been brewing a hypothesis that we're racing to upload 32 copies of things in some cases when running integration tests. Will confirm that, but if so there are a few tacks:

Add a generalized async semaphore to throttle how much we hold in memory during these operations
~~Avoid holding these in memory while we upload them (they're in LMDB: no need to actually hold them in userspace while uploading)~~
Add some per-file memoization to de-dupe uploads

Eric-Arellano · 2020-05-12T16:49:06Z

As a lay observer, #3 sounds the most compelling because it would presumably remove unnecessary work.

Eric-Arellano · 2020-05-13T15:53:08Z

Encountered this morning https://api.travis-ci.com/v3/job/332126695/log.txt:

18:56:46:483 [INFO] Starting tests: tests/python/pants_test/backend/project_info/tasks:export_dep_as_jar_integration
memory allocation of 194079374 bytes failed�

And then it fails.

illicitonion · 2020-05-16T11:31:34Z

3. Add some per-file memoization to de-dupe uploads

I imagine this taking the form of a RemoteCasTracker which is basically:

enum RemoteState  {
  ProbablyUploaded,
  Uploading(SharedFuture<()>),
}

struct RemoteCasTracker {
  state: LruCache<Digest, RemoteState>
}

Each time an upload is attempted, we'd check the RemoteCasTracker - if it's ProbablyUploaded we'll skip it, if it's Uploading we'll return a clone of the upload future, and if we get definite information that it isn't uploaded we'll trigger an upload and set it to the Uploading state.

### Problem The LMDB API allows for direct references into its storage, which will always be MMAP'ed. But because it has a blocking API, we interact with it on threads that have been spawned for that purpose. To allow for interacting with the database in a way that avoids copying data into userspace, the `load_bytes_with` method takes a function that will be called on the blocking thread with a reference directly to that memory. This allows for minimizing copies of data, and avoiding holding full copies of database entries in memory. But at some point a while back (before async/await made dealing with references easy again), `load_bytes_with` started passing a `Bytes` instance to its callback. And constructing a `Bytes` instance with anything other than a static memory reference (which this isn't) requires copying into the `Bytes` instance to give it ownership. This meant that we weren't actually taking advantage of the odd shape of `load_bytes_with`. ### Solution Switch the local `load_bytes_with` back to providing a reference into the database's owned memory, and document the reason for its shape. In separate commits, port the `serverset` crate and code that touches it to async/await. Finally, adjust the remote `store_bytes` to accept a reference to avoid potential copies there as well. ### Result Less memory usage, and less copying of data. In particular: cases that read from the local database in order to copy elsewhere (such as `materialize_directory`, `ensure_remote_has_recursive`, and BRFS `read`) will now copy data directly out of the database and into the destination. `ensure_remote_has_recursive` should now hold onto only [one chunk's worth](https://github.com/pantsbuild/pants/blob/4f45871814c15c0f41b335fc80f98780b9b38d92/src/python/pants/option/global_options.py#L752-L758) of data at a time per file it is uploading, which might be sufficient to fix #9395 (although the other changes mentioned there will likely still be useful).

stuhood · 2020-05-20T00:42:35Z

No repros of this since #9793. We'll likely still want to do one or both of the other things from #9395 (comment) at some point, but will wait a bit longer before closing this and opening a followup.

I imagine this taking the form of a RemoteCasTracker which is basically:

Might also be able to get away with an async Mutex around the result value, which might make it easier to recover from failure.

stuhood · 2020-06-04T19:30:04Z

No repros in a few weeks! I've moved the optimization ideas to #9960, and am resolving this as fixed.

jsirois added bug engine labels Mar 25, 2020

stuhood self-assigned this Apr 4, 2020

stuhood mentioned this issue Apr 4, 2020

Bump to grpcio 0.5.1. #9470

Merged

stuhood mentioned this issue May 16, 2020

Remove forced copying in local load_bytes_with #9793

Merged

stuhood closed this as completed in #9793 May 16, 2020

stuhood reopened this May 16, 2020

stuhood mentioned this issue Jun 4, 2020

Dedupe uploads to the remote CAS #9960

Closed

stuhood closed this as completed Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent crash or failure when remoting. #9395

Silent crash or failure when remoting. #9395

jsirois commented Mar 25, 2020

jsirois commented Mar 25, 2020

Eric-Arellano commented Mar 25, 2020

Eric-Arellano commented Mar 27, 2020

stuhood commented Apr 11, 2020 •

edited

Loading

Eric-Arellano commented Apr 16, 2020

stuhood commented Apr 17, 2020 •

edited

Loading

stuhood commented Apr 21, 2020

stuhood commented Apr 21, 2020

stuhood commented May 12, 2020 •

edited

Loading

Eric-Arellano commented May 12, 2020

Eric-Arellano commented May 13, 2020 •

edited

Loading

illicitonion commented May 16, 2020

stuhood commented May 20, 2020 •

edited

Loading

stuhood commented Jun 4, 2020

Silent crash or failure when remoting. #9395

Silent crash or failure when remoting. #9395

Comments

jsirois commented Mar 25, 2020

jsirois commented Mar 25, 2020

Eric-Arellano commented Mar 25, 2020

Eric-Arellano commented Mar 27, 2020

stuhood commented Apr 11, 2020 • edited Loading

Eric-Arellano commented Apr 16, 2020

stuhood commented Apr 17, 2020 • edited Loading

stuhood commented Apr 21, 2020

stuhood commented Apr 21, 2020

stuhood commented May 12, 2020 • edited Loading

Eric-Arellano commented May 12, 2020

Eric-Arellano commented May 13, 2020 • edited Loading

illicitonion commented May 16, 2020

stuhood commented May 20, 2020 • edited Loading

stuhood commented Jun 4, 2020

stuhood commented Apr 11, 2020 •

edited

Loading

stuhood commented Apr 17, 2020 •

edited

Loading

stuhood commented May 12, 2020 •

edited

Loading

Eric-Arellano commented May 13, 2020 •

edited

Loading

stuhood commented May 20, 2020 •

edited

Loading