-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/shm, prov/efa, fabtests/hmem, contrib/intel: enable full fabtests FI_HMEM support and enable in Intel CI #9404
Conversation
I know it's touching a lot of different areas of the code. I can split it up into different PRs if you like but want to make sure all of them together pass everything |
contrib/intel/jenkins/Jenkinsfile
Outdated
if (util) | ||
opts = "${opts} --util=${util}" | ||
|
||
if (user_env) | ||
opts = "${opts} --user_env ${user_env}" | ||
|
||
for (mode in BUILD_MODES) { | ||
if (way) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs curly brackets around it. Right now youre forcing fabtests to only do reg since the modes = ["reg"] line is not protected in this if.
bot:aws:retest |
It fails AWS CI in rdm_atomic test on single node with EFA provider
|
It also failed the
|
ZE v2 hardware no longer a priority for testing, replaced by v3 Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Adding FI_HMEM support to atomic calls requires the correct shm descriptor for the result and compare iovs. Efa was passing in the efa descriptors and not the shm ones. Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
@shijin-aws @a-szegel Updated PR with fix in efa for passing in the correct shm descriptor into the atomic calls. Let me know if you see any issues with it! |
@a-szegel @shijin-aws The AWS CI was still failing with a similar MR descriptor issue so I added another fix and now it's passing but it's weird that everything else was passing without that and only the atomic path seems to be problematic... Let me know what you think or if you want to rerun/do more testing to investigate the impact before merging. |
@aingerson Your fix to efa provider looks reasonable to me, thanks.
Are you talking about this fix 61140ec? I think that function assumes shm_desc is only updated when efa_mr is present. We already set shm_desc as NULL before calling this function in |
@shijin-aws Thank you for the explanation! I missed the NULL initialization in the msg and RMA paths but that makes sense now. Would you like me to remove the other initialization so that you don't have to initialize twice? Or just leave it? |
I think removing other initialization is better, thank you |
@shijin-aws Updated to remove the NULL initialization in the other places. It was missing in the readmsg and writemsg paths so could be good to backport the change to make sure. I'll leave that up to you though. |
I will backport your fix commits for efa to v1.19.x, thanks |
@shijin-aws @a-szegel More failures happened when I removed the NULL initialization, and it was because there were a lot of checks that initialized the shm_desc conditionally so I had to remove them all. I also did a little of cleanup with some initialization that seemed unnecessary to cleanup the shm send path. Let me know if I overstepped and you want me to undo those. |
@aingerson how about we just initialize |
shm now has atomic support for FI_HMEM can can be enabled Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Add support for FI_HMEM and atomics by passing the desc/ofi_mr to copy atomic data to and from the shm region. On the target side, use a temporary host buffer for the atomic operation and then copy the result into the destination using the hmem functions Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Use the dev_host_buf for filling and checking the sent/received greeting in order to use device memory for the actual transfer. This also moves the received data print into the check greeting function and removes it from all places it is called from. Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Use dev_host_buf to copy data into and out of transfer buffers to support FI_HMEM Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
The uint64_t device was updated to include the device and driver indices. The index is isolated using the ze_get_device_idx function which masks the driver bits. The cmd_queue used in the copy function was using the full device rather than just the dev_id Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Since not all of the fabtests supported device memory, the CI used a wrapper script to run a subset of tests that were supported. Now that all of the runfabtests support FI_HMEM properly, this simplifies the ZE testing by going through the regular run_fabtests path. The --device parameter is removed and replaced with the --way parameter which indicates which direction to test (h2d, d2d, xd2d, default None). This simplifies the code path and enables the ZE testing to run the full testsuite, rather than the subset, increasing our coverage. Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
The AWS failure is not related to this change (TCP). |
bot:aws:retest |
Full patch set enables shm+FI_HMEM to be run with complete fabtests set (by fixing missing support and adding missing shm functionality) and enables full runs on the Intel CI