-
-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APEX: messed up JSON with APEX_TRACE_EVENT=1 for Google Trace Events #5239
Comments
@albestro hey there ... I would be surprised if there are race conditions writing to the stream, but it is certainly possible. A couple of questions:
|
I think the issue is exactly that the ranks don't know what id they are (networking is off in HPX). I know @biddisco worked on that at some point, and possibly even made it work, but given that it's not tested it's probably broken. @biddisco do you remember where the rank was supposed to be set if networking is off? @khuck in the meantime, if I remember correctly it's possible to set the output filename for OTF2 files with an environment variable. Is it possible to do the same for the google trace event files? |
hey @khuck! thanks for your quick reply!
Correct, using default selected. From HPX CMake configure process I see
APEX_OTF2=1 srun -n 2 -c36 /users/ialberto/workspace/dla-future/builds/daint/miniapp/miniapp_reduction_to_band --matrix-size 10240 --block-size 512 --grid-rows 2 --grid-cols 1 --nwarmups=0 --nruns 1 --hpx:use-process-mask
[OTF2] src/otf2_archive_int.c:3945: error: File does already exist: Could not create archive trace directory!
[OTF2] src/otf2_archive_int.c:1108: error: File does already exist: Couldn't create directories on root.
OTF2 Error: 17, File does already exist
[0]
[0] 26.1765s nanGFlop/s (10240, 10240) (512, 512) (2, 1) 18
Closing OTF2 event files...
Closing OTF2 event files...
Writing OTF2 definition files...
Writing OTF2 Global definition file...
Writing OTF2 Node information...
Writing OTF2 Communicators...
Closing the archive...
done.
// here it hangs... After killing it, the content of the folder is OTF2_archive/
|-- APEX
| |-- 0.def
| |-- 0.evt
| |-- 1.def
| |-- 1.evt
| |-- 10.def
| |-- 10.evt
| |-- 11.def
| |-- 11.evt
| |-- 12.def
| |-- 12.evt
| |-- 13.def
| |-- 13.evt
| |-- 14.def
| |-- 14.evt
| |-- 15.def
| |-- 15.evt
| |-- 16.def
| |-- 16.evt
| |-- 17.def
| |-- 17.evt
| |-- 18.def
| |-- 18.evt
| |-- 2.def
| |-- 2.evt
| |-- 3.def
| |-- 3.evt
| |-- 4.def
| |-- 4.evt
| |-- 5.def
| |-- 5.evt
| |-- 6.def
| |-- 6.evt
| |-- 7.def
| |-- 7.evt
| |-- 8.def
| |-- 8.evt
| |-- 9.def
| `-- 9.evt
|-- APEX.def
`-- APEX.otf2 But if I open it with Vampir, I get an empty trace
NOTE: A thing that may be relevant and I forgot to specify: I built I'm running my application with 2 ranks on the same node, and the application itself uses MPI directly. APEX_TRACE_EVENT=1 srun -n 2 -c36 /users/ialberto/workspace/dla-future/builds/daint/miniapp/miniapp_reduction_to_band --matrix-size 10240 --block-size 512 --grid-rows 2 --grid-cols 1 --nwarmups=0 --nruns 1 --hpx:use-process-mask Here it is trace.tar.gz. If you try to load it with Google Format tool it says the bytes offset where it fails. Just for reference, a couple of errors: line 95424: the entry is truncated (after "ts" field) and a new one starts {"name":"hpx_main","ph":"X","pid":173700657,"tid":19,"ts":1615799668667355.000{"name":"hpx_main","ph":"X","pid":0,"tid":19,"ts":1615799668616451.000000, "dur": 83.561636,"args":{"GUID":14411518807585721556,"Parent GUID":14411518807585721555}}, line 252987: there is the termination of the array (end of first rank data?) but then data continues, but starts truncated {"name":"APEX MAIN", "ph":"E","pid":0,"tid":0,"ts":1615799684527165.500000}
]
}
x_main","ph":"X","pid":0,"tid":13,"ts":1615799683699657.750000, "dur": 32.267636,"args":{"GUID":2305843009213702076,"Parent GUID":6917529027641089785}},
|
Based on what you've said, it sounds like the ranks are writing to the same file. Which MPI are you using? Which batch system? Without HPX networking, APEX can sometimes guess the ranks based on environment variables: If this is a Cray system, I might have to add support for their variable ( |
Yes, it's Cray MPICH with Slurm |
Slurm should have been caught, but I'll add support for CRAY... |
I gave a quick look at the snippet of code you mentioned, and I did a quick test. $ srun -n 2 -o "%t-env.txt" printenv
$ grep "SLURM_PROCID" *-env.txt
0-env.txt:SLURM_PROCID=0
1-env.txt:SLURM_PROCID=1 I got the same results with 2 ranks being them on a single node or on separate nodes. So, I would confirm your expectation, if that snippet of code gets called, it would be able to extract data from SLURM environment variables. |
Hey, just to be sure...
uint64_t test_for_MPI_comm_rank(uint64_t commrank) { // WARNING here commrank is not passed by reference
// ...
// Slurm - last resort
tmpvar = getenv("SLURM_PROCID");
if (tmpvar != NULL) {
commrank = atol(tmpvar);
return commrank;
}
return commrank;
}
EDIT: Just realized that it seems you are using it as fallback value. |
I just copied the function srun -n 2 a.out
SLURM
1
SLURM
0 So, I think we should look in other directions. Do you have any suggestion on how to debug this? |
You could try printing out the trace file name when the file is created: If you are getting two unique file names, then there is something else going on. |
I've just checked, and I get exactly the same filename on all (2) ranks (tested on the same node for ease). Specifically, While I wait for your next suggestion on how to go on investigating this, I take the chance to ask you one question (I have more about this topic, it would be nice to quickly discuss with you about a couple of things. Let me know if you are open to this and how we can organize it in case). So, once this will be fixed, when using Google Trace Tool we are supposed to open each rank trace in a separate window, right? |
@albestro It would appear there are two issues here. First, there is a bug in the OTF2 library, which I thought we had reported and had been fixed. In OTF2, you need to patch the
The second issue is that when HPX is configured without parcel port support, HPX can't tell APEX how many ranks there are, and the rank ID for each process. That's OK, as long as the MPI implementation sets some environment variables. I have added additional support for Cray systems which might take care of the issue with both tracing outputs - OTF2 and Google Trace Events. Please configure HPX with |
I did like you suggested, but without luck.
Google Trace Events still creates a single file with the usual "output-overlaps" between ranks. OTF2, still hangs on exit
If I kill the process, I get
|
@albestro this looks more like a shutdown race condition. Did you at least get a valid OTF2 trace? I'll see if I can reproduce the race condition. Is there any way you can share the application that is causing this deadlock? |
As I encounter a very similar issue with APEX/OTF2, here is a small example program that allows me to reliably reproduce the issue. The OTF2 output deadlocks even earlier here, all I get is: Closing OTF2 event files...
Writing OTF2 definition files... The path |
I don't know if that's exactly the case, because as you can see in a previous message, I get some output files.
The output folder is populated with files, but when I open it with Vampir I don't get anything (as in the previous #5239 (comment))
It's a miniapp of my library, I don't have any problem sharing it with you, but it may require some work on your side to prepare the environment (it is spack based so it won't require too much). Before that, I can try to see if I can reproduce it with a more simple example...I'll let you know. In the meanwhile, is it worth separating the two issues "otf2 hanging" and "google trace events single-file race-condition"? |
As pointed out in STEllAR-GROUP/hpx#5239, there is an issues in OTF2 2.2 where a variable is not properly initialized. As currently no release of OTF2 is available fixing this, the patch should be applied.
As pointed out in STEllAR-GROUP/hpx#5239, there is an issues in OTF2 <=2.2 where a variable is not properly initialized. As currently no release of OTF2 is available fixing this, the patch should be applied.
As pointed out in STEllAR-GROUP/hpx#5239, there is an issues in OTF2 <=2.2 where a variable is not properly initialized. As currently no release of OTF2 is available fixing this, the patch should be applied.
Hey @khuck, yesterday I found the time for giving this a look. I'm keeping thinking that the root of the two problems, i.e. "OTF2 hang" and "Google Trace overlap", might be the same. It looks like a race-condition in both cases, once it ends up with an hanging, the other one with a write on the same file concurrently, but that's just my speculation without having much knowledge about this stuff. Yesterday I worked on the OTF2 problem. Curiously enough, after a few tries, I was able to reproduce the problem with a super simple executable #include <hpx/hpx.hpp>
#include <hpx/hpx_init.hpp>
#include <mpi.h>
int hpx_main() {
hpx::make_ready_future<void>().then([](auto&&){
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
hpx::util::annotate_function _(std::to_string(rank));
});
return hpx::finalize();
}
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
auto ret_code = hpx::init(argc, argv);
MPI_Finalize();
return ret_code;
} I'm not really sure everything is needed to reproduce it, but it is simple enough and at the same time it looks very similar to my classic run (just as reminder The game changer to reproduce the OTF2 hang was the number of ranks. In fact, the probability to get a hang is directly proportional to the number of ranks I use (e.g. generally 8 is a good number, but with more it was way more frequent = almost every run was hanging). An important point about this, it is that I worked in a multi rank per node configuration, and I did not have the chance to test it with multiple nodes yet (I can try). I think this may be quite useful for you to narrow down the problem, even considering what you were saying in your previous comments. I don't know how the OTF2 is expected to work (one folder per rank? one folder per node?), but for you it may be insightful to see the following stacktrace. In order to collect it, I ran a "hanging configuration" with multiple ranks on a single node, and among these ranks I saw that there was a number of ranks waiting on
At the very end of the above gdb session, you can see I selected the
This may be just the effect and not the root of the problem, but with your knowledge and experience about these libraries you may be able to understand better the bigger figure. As always, I'm open to any other test/check you want me to do. note: I'm using HPX 1.6.0, APEX develop, OTF2 patched |
@albestro Ah, this makes perfect sense! The "file based" method is the "least good" choice among HPX, MPI and shared filesystems for doing event unification, which is required by OTF2. My guess is that the file-based solution isn't working, probably because the application doesn't know how many total ranks there are! Ideally, if you configure APEX with HPX, it uses HPX to do the event unification at the end. However, you aren't using HPX parcel ports, so that is not an option. Here are some possible solutions, from easiest to hardest:
solution number 1 will be a faster fix than number 2... |
I reviewed the slurm code for detecting environment variables, and I think I found the (or at least 'a') problem. I was using the wrong variable (documentation can be misleading...): I think if you pull the latest APEX develop branch (you'll get some Kokkos changes with it, but they shouldn't affect you), I hope it'll fix the problem. |
Aha! Good catch! Tested my basic example with UO-OACISS/apex@e029c42 and I can see the difference with both OTF2 and GoogleTraceFormat!
I'll give a try as soon as possible with my miniapp, but I think this issue can be closed. @khuck I take the chance to point you to this discussion #5263, which I'd like to have with you about a "guideline" on how it is better to use APEX. Thanks for your support and in advance for any comment on the discussion! |
As pointed out in STEllAR-GROUP/hpx#5239, there is an issues in OTF2 <=2.2 where a variable is not properly initialized. As currently no release of OTF2 is available fixing this, the patch should be applied.
As pointed out in STEllAR-GROUP/hpx#5239, there is an issues in OTF2 <=2.2 where a variable is not properly initialized. As currently no release of OTF2 is available fixing this, the patch should be applied.
Due to other problems I have with OTF2 (it hangs on exit, I may create a different issue for that), I'm giving a try to Google Trace Events Format with
APEX_TRACE_EVENT=1
.I was able to run the my code, it has a clean exit and it produces the output. Unfortunately, the json produced is messed up like if there is a race-condition between multiple ranks writing to the same json file.
Eventually I was able to open the json after fixing manually the messed up parts of the file. In my specific runs I was able to quickly figure out what to remove (clearly losing a few entries) thanks also to the Google Tracing Tool that was pointing me to the buffer position where it found a grammar problem.
I didn't try to create a minimal reproducible example since it may not be so easy and deterministic, anyway I can help in testing it with my code which till now always generates a wrongly-formed json.
HPX team mentioned you @khuck as the APEX expert, let me know if I can help you in some way or I can provide you more information via other specific tests.
(OT: it may be interesting having a chat/call with you about the general topic "annotation". I'm open to that, let me know!)
Specifications
The text was updated successfully, but these errors were encountered: