Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRASH when processing large offline trace #6854

Open
dbiton opened this issue Jun 23, 2024 · 2 comments
Open

CRASH when processing large offline trace #6854

dbiton opened this issue Jun 23, 2024 · 2 comments

Comments

@dbiton
Copy link

dbiton commented Jun 23, 2024

On a powerful Linux version 5.15 server, Running CPU2017's:
CPU/531.deepsjeng_r/exe/deepsjeng_r_base.mytest-m64
with the test dataset:
CPU/531.deepsjeng_r/data/test/input/test.txt
Execution without drcachesim is done after 5 seconds, with drcachesim 2 minutes.
I generate an offline trace, no "special" parameters are used. In the raw folder is a single .lz4 file sized 12GB, decompressing manually yields 57GB. I use drcachesim's Linux binaries version 10.0.0 that can be downloaded from the site.
When running the cache simulator on the offline trace, after 10 minutes I get the following error:
Failed to open drmemtrace.deepsjeng_r_base.mytest-m64.190612.5977.dir/trace/drmemtrace.deepsjeng_r_base.mytest-m64.190612.1075.trace.zip Failed to initialize scheduler: Failed to open drmemtrace.deepsjeng_r_base.mytest-m64.190612.5977.dir/trace/drmemtrace.deepsjeng_r_base.mytest-m64.190612.1075.trace.zip ERROR: failed to initialize analyzer: raw2trace failed: Failed to process file for thread 190612: Failed to close prior component

Trying to parse the trace again immediately yields the error:
Failed to open drmemtrace.deepsjeng_r_base.mytest-m64.190612.5977.dir/trace/drmemtrace.deepsjeng_r_base.mytest-m64.190612.1075.trace.zip Failed to initialize scheduler: Failed to open drmemtrace.deepsjeng_r_base.mytest-m64.190612.5977.dir/trace/drmemtrace.deepsjeng_r_base.mytest-m64.190612.1075.trace.zip ERROR: failed to initialize analyzer
checking out the trace folder created, there is a .zip file sized 2.8GB. Trying to run unzip -l on it yields:
Archive: drmemtrace.deepsjeng_r_base.mytest-m64.200403.7015.trace.zip End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of drmemtrace.deepsjeng_r_base.mytest-m64.200403.7015.trace or drmemtrace.deepsjeng_r_base.mytest-m64.200403.7015.trace.zip, and cannot find drmemtrace.deepsjeng_r_base.mytest-m64.200403.7015.trace.ZIP, period.

Would appreciate pointers, I assume this is some sort of bug relating to large traces. I can parse the traces from google's charlie folder, with the largest file being about 2GB compressed if I remember correctly. I can also compile my own cpp programs, instrument them and parse the trace.

@dbiton
Copy link
Author

dbiton commented Jun 23, 2024

Using -debug does not change the output or fix the problem

@derekbruening
Copy link
Contributor

First:

If not, then: "Failed to close prior component" implies an archive file issue, so:

  • Does -trace_compress lz4 work for post-processing from raw to final? It won't support fast instruction skipping like zip though. That would confirm there is no other problem outside of archive output.
  • Does the latest version of zlib fix the problem? You can see we have a submodule using https://github.com/madler/zlib.git at a fixed revision: updating that to the latest and building from sources would be a good test as maybe this is a known fixed bug in the zlib minizip library.

If the latest zlib doesn't fix it, it would need further debugging:

  • Please either edit the code at https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/common/zipfile_ostream.h#L111 to print out the zip library error code and build DR from sources, or use a debugger to break on the failure in the zip library get the error code.
  • Getting the failure point in the debugger would also be useful to see the raw2trace.cpp state at that point: how many 10-million-instruction chunks into the trace are we?
  • Does changing the chunk size (archive sub-component size) from the default 10M to say 1K -chunk_instr_count 1K change the behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants