Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm, prov/efa, fabtests/hmem, contrib/intel: enable full fabtests FI_HMEM support and enable in Intel CI #9404

Merged
merged 8 commits into from
Oct 16, 2023

Conversation

aingerson
Copy link
Contributor

Full patch set enables shm+FI_HMEM to be run with complete fabtests set (by fixing missing support and adding missing shm functionality) and enables full runs on the Intel CI

  • Fix missing fabtests support for FI_HMEM (fi_rdm, fi_rdm_event)
  • Add support for use of FI_ATOMIC with FI_HMEM
  • Remove ZE-specific stage class since we can run directly through fabtests script
  • Fix ZE bug using incorrect device id
  • Remove ZE v2 testing (fully replace with v3)

@aingerson
Copy link
Contributor Author

I know it's touching a lot of different areas of the code. I can split it up into different PRs if you like but want to make sure all of them together pass everything

if (util)
opts = "${opts} --util=${util}"

if (user_env)
opts = "${opts} --user_env ${user_env}"

for (mode in BUILD_MODES) {
if (way)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs curly brackets around it. Right now youre forcing fabtests to only do reg since the modes = ["reg"] line is not protected in this if.

@aingerson
Copy link
Contributor Author

bot:aws:retest

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 6, 2023

It fails AWS CI in rdm_atomic test on single node with EFA provider

--------------------------------- Captured Log ---------------------------------

--------------------------------- Captured Out ---------------------------------

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.84 'timeout 1800 /bin/bash --login -c '"'"'FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404-debug/install/fabtests/bin/fi_rdm_atomic -p efa -E=9230'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.84 'timeout 1800 /bin/bash --login -c '"'"'FI_EFA_USE_DEVICE_RDMA=0 /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404-debug/install/fabtests/bin/fi_rdm_atomic -p efa -E=9230 172.31.45.84'"'"''
client_stdout:
Provider doesn't support FI_MIN base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_MIN base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_MIN base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_MIN base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_MIN base atomic operation on FI_INT128
Provider doesn't support FI_MIN base atomic operation on FI_UINT128
Provider doesn't support FI_MAX base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_MAX base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_MAX base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_MAX base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_MAX base atomic operation on FI_INT128
Provider doesn't support FI_MAX base atomic operation on FI_UINT128
Provider doesn't support FI_SUM base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_SUM base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_SUM base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_SUM base atomic operation on FI_INT128
Provider doesn't support FI_SUM base atomic operation on FI_UINT128
name                                              bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
FI_INT8_FI_MIN_base_lat                           1       1k      1000        0.00s      0.57       1.74       0.57
FI_UINT8_FI_MIN_base_lat                          1       1k      1000        0.00s      0.57       1.74       0.57
FI_INT16_FI_MIN_base_lat                          2       1k      1.9k        0.00s      1.18       1.70       0.59
FI_UINT16_FI_MIN_base_lat                         2       1k      1.9k        0.00s      1.18       1.70       0.59
FI_INT32_FI_MIN_base_lat                          4       1k      3.9k        0.00s      2.35       1.70       0.59
FI_UINT32_FI_MIN_base_lat                         4       1k      3.9k        0.00s      2.39       1.67       0.60
FI_INT64_FI_MIN_base_lat                          8       1k      7.8k        0.00s      4.69       1.71       0.59
FI_UINT64_FI_MIN_base_lat                         8       1k      7.8k        0.00s      4.66       1.72       0.58
FI_FLOAT_FI_MIN_base_lat                          4       1k      3.9k        0.00s      2.37       1.69       0.59
FI_DOUBLE_FI_MIN_base_lat                         8       1k      7.8k        0.00s      4.68       1.71       0.58
FI_INT8_FI_MAX_base_lat                           1       1k      1000        0.00s      0.59       1.71       0.59
FI_UINT8_FI_MAX_base_lat                          1       1k      1000        0.00s      0.59       1.70       0.59
FI_INT16_FI_MAX_base_lat                          2       1k      1.9k        0.00s      1.19       1.68       0.60
FI_UINT16_FI_MAX_base_lat                         2       1k      1.9k        0.00s      1.18       1.69       0.59
FI_INT32_FI_MAX_base_lat                          4       1k      3.9k        0.00s      2.36       1.70       0.59
FI_UINT32_FI_MAX_base_lat                         4       1k      3.9k        0.00s      2.37       1.69       0.59
FI_INT64_FI_MAX_base_lat                          8       1k      7.8k        0.00s      4.74       1.69       0.59
FI_UINT64_FI_MAX_base_lat                         8       1k      7.8k        0.00s      4.73       1.69       0.59
FI_FLOAT_FI_MAX_base_lat                          4       1k      3.9k        0.00s      2.43       1.65       0.61
FI_DOUBLE_FI_MAX_base_lat                         8       1k      7.8k        0.00s      4.76       1.68       0.59
FI_INT8_FI_SUM_base_lat                           1       1k      1000        0.00s      0.59       1.70       0.59
FI_UINT8_FI_SUM_base_lat                          1       1k      1000        0.00s      0.59       1.69       0.59
FI_INT16_FI_SUM_base_lat                          2       1k      1.9k        0.00s      1.17       1.71       0.59
FI_UINT16_FI_SUM_base_lat                         2       1k      1.9k        0.00s      1.19       1.68       0.60
FI_INT32_FI_SUM_base_lat                          4       1k      3.9k        0.00s      2.38       1.68       0.59
FI_UINT32_FI_SUM_base_lat                         4       1k      3.9k        0.00s      2.37       1.69       0.59
FI_INT64_FI_SUM_base_lat                          8       1k      7.8k        0.00s      4.75       1.68       0.59
FI_UINT64_FI_SUM_base_lat                         8       1k      7.8k        0.00s      4.78       1.67       0.60
FI_FLOAT_FI_SUM_base_lat                          4       1k      3.9k        0.00s      2.37       1.69       0.59
FI_DOUBLE_FI_SUM_base_lat                         8       1k      7.8k        0.00s      4.73       1.69       0.59
FI_FLOAT_COMPLEX_FI_SUM_base_lat                  8       1k      7.8k        0.00s      4.76       1.68       0.60
FI_INT8_FI_PROD_base_lat                          1       1k      1000        0.00s      0.59       1.69       0.59
FI_UINT8_FI_PROD_base_lat                         1       1k      1000        0.00s      0.59       1.69       0.59
FI_INT16_FI_PROD_base_lat                         2       1k      1.9k        0.00s      1.19       1.69       0.59
FI_UINT16_FI_PROD_base_lat      Provider doesn't support FI_PROD base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_PROD base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_PROD base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_PROD base atomic operation on FI_INT128
Provider doesn't support FI_PROD base atomic operation on FI_UINT128
Provider doesn't support FI_LOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_LOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_LOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_LOR base atomic operation on FI_INT128
Provider doesn't support FI_LOR base atomic operation on FI_UINT128
Provider doesn't support FI_LAND base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_LAND base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_LAND base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_LAND base atomic operation on FI_INT128
Provider doesn't support FI_LAND base atomic operation on FI_UINT128
                  2       1k      1.9k        0.00s      1.20       1.67       0.60
FI_INT32_FI_PROD_base_lat                         4       1k      3.9k        0.00s      2.37       1.69       0.59
FI_UINT32_FI_PROD_base_lat                        4       1k      3.9k        0.00s      2.37       1.69       0.59
FI_INT64_FI_PROD_base_lat                         8       1k      7.8k        0.00s      4.73       1.69       0.59
FI_UINT64_FI_PROD_base_lat                        8       1k      7.8k        0.00s      4.73       1.69       0.59
FI_FLOAT_FI_PROD_base_lat                         4       1k      3.9k        0.00s      2.39       1.67       0.60
FI_DOUBLE_FI_PROD_base_lat                        8       1k      7.8k        0.00s      4.74       1.69       0.59
FI_FLOAT_COMPLEX_FI_PROD_base_lat                 8       1k      7.8k        0.00s      4.66       1.72       0.58
FI_INT8_FI_LOR_base_lat                           1       1k      1000        0.00s      0.59       1.69       0.59
FI_UINT8_FI_LOR_base_lat                          1       1k      1000        0.00s      0.59       1.69       0.59
FI_INT16_FI_LOR_base_lat                          2       1k      1.9k        0.00s      1.23       1.63       0.62
FI_UINT16_FI_LOR_base_lat                         2       1k      1.9k        0.00s      1.26       1.59       0.63
FI_INT32_FI_LOR_base_lat                          4       1k      3.9k        0.00s      2.51       1.59       0.63
FI_UINT32_FI_LOR_base_lat                         4       1k      3.9k        0.00s      2.44       1.64       0.61
FI_INT64_FI_LOR_base_lat                          8       1k      7.8k        0.00s      4.91       1.63       0.61
FI_UINT64_FI_LOR_base_lat                         8       1k      7.8k        0.00s      5.13       1.56       0.64
FI_FLOAT_FI_LOR_base_lat                          4       1k      3.9k        0.00s      2.48       1.62       0.62
FI_DOUBLE_FI_LOR_base_lat                         8       1k      7.8k        0.00s      4.79       1.67       0.60
FI_FLOAT_COMPLEX_FI_LOR_base_lat                  8       1k      7.8k        0.00s      4.76       1.68       0.59
FI_INT8_FI_LAND_base_lat                          1       1k      1000        0.00s      0.62       1.62       0.62
FI_UINT8_FI_LAND_base_lat                         1       1k      1000        0.00s      0.63       1.59       0.63
FI_INT16_FI_LAND_base_lat                         2       1k      1.9k        0.00s      1.24       1.61       0.62
FI_UINT16_FI_LAND_base_lat                        2       1k      1.9k        0.00s      1.23       1.63       0.62
FI_INT32_FI_LAND_base_lat                         4       1k      3.9k        0.00s      2.45       1.63       0.61
FI_UINT32_FI_LAND_base_lat                        4       1k      3.9k        0.00s      2.40       1.67       0.60
FI_INT64_FI_LAND_base_lat                         8       1k      7.8k        0.00s      4.83       1.66       0.60
FI_UINT64_FI_LAND_base_lat                        8       1k      7.8k        0.00s      4.87       1.64       0.61
FI_FLOAT_FI_LAND_base_lat                         4       1k      3.9k        0.00s      2.52       1.59       0.63
FI_DOUBLE_FI_LAND_base_lat                        8       1k      7.8k        0.00s      4.86       1.65       0.61
FI_FLOAT_COMPLEX_FI_LAND_base_lat                 8       1k      7.8k        0.00s      4.76       1.68       0.60
FI_INT8_FI_BOR_base_lat                           1       1k      1000        0.00s      0.59       1.68       0.59
FI_UINT8_FI_BOR_base_lat                          1       1k      1000        0.00s      0.59       1.70       0.59
FI_INT16_FI_BOR_base_lat                          2       1k      1.9k        0.00s      1.18       1.69       0.59
FI_UINT16_FI_BOR_base_lat                         2       1k      1.9k        0.00s      1.20       1.67       0.60
FI_INT32_FI_BOR_base_lat                          4       1k      3.9k        0.00s      2.38       1.68       0.60
FI_UINT32_FI_BOR_base_lat                         4       1k      3.Provider doesn't support FI_BOR base atomic operation on FI_FLOAT
Provider doesn't support FI_BOR base atomic operation on FI_DOUBLE
Provider doesn't support FI_BOR base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_BOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_BOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_BOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_BOR base atomic operation on FI_INT128
Provider doesn't support FI_BOR base atomic operation on FI_UINT128
Provider doesn't support FI_BAND base atomic operation on FI_FLOAT
Provider doesn't support FI_BAND base atomic operation on FI_DOUBLE
Provider doesn't support FI_BAND base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_BAND base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_BAND base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_BAND base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_BAND base atomic operation on FI_INT128
Provider doesn't support FI_BAND base atomic operation on FI_UINT128
Provider doesn't support FI_LXOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_LXOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_LXOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_LXOR base atomic operation on FI_INT128
Provider doesn't support FI_LXOR base atomic operation on FI_UINT128
Provider doesn't support FI_BXOR base atomic operation on FI_FLOAT
Provider doesn't support FI_BXOR base atomic operation on FI_DOUBLE
Provider doesn't support FI_BXOR base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_BXOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_BXOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_BXOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_BXOR base atomic operation on FI_INT128
Provider doesn't support FI_BXOR base atomic operation on FI_UINT128
timeout: the monitored command dumped core

client returncode: 255
server_stdout:
Provider doesn't support FI_MIN base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_MIN base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_MIN base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_MIN base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_MIN base atomic operation on FI_INT128
Provider doesn't support FI_MIN base atomic operation on FI_UINT128
Provider doesn't support FI_MAX base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_MAX base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_MAX base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_MAX base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_MAX base atomic operation on FI_INT128
Provider doesn't support FI_MAX base atomic operation on FI_UINT128
Provider doesn't support FI_SUM base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_SUM base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_SUM base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_SUM base atomic operation on FI_INT128
Provider doesn't support FI_SUM base atomic operation on FI_UINT128
name                                              bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
FI_INT8_FI_MIN_base_lat                           1       1k      1000        0.00s      0.58       1.73       0.58
FI_UINT8_FI_MIN_base_lat                          1       1k      1000        0.00s      0.60       1.65       0.60
FI_INT16_FI_MIN_base_lat                          2       1k      1.9k        0.00s      1.18       1.70       0.59
FI_UINT16_FI_MIN_base_lat                         2       1k      1.9k        0.00s      1.16       1.73       0.58
FI_INT32_FI_MIN_base_lat                          4       1k      3.9k        0.00s      2.32       1.72       0.58
FI_UINT32_FI_MIN_base_lat                         4       1k      3.9k        0.00s      2.33       1.72       0.58
FI_INT64_FI_MIN_base_lat                          8       1k      7.8k        0.00s      4.65       1.72       0.58
FI_UINT64_FI_MIN_base_lat                         8       1k      7.8k        0.00s      4.68       1.71       0.59
FI_FLOAT_FI_MIN_base_lat                          4       1k      3.9k        0.00s      2.32       1.72       0.58
FI_DOUBLE_FI_MIN_base_lat                         8       1k      7.8k        0.00s      4.73       1.69       0.59
FI_INT8_FI_MAX_base_lat                           1       1k      1000        0.00s      0.58       1.72       0.58
FI_UINT8_FI_MAX_base_lat                          1       1k      1000        0.00s      0.58       1.72       0.58
FI_INT16_FI_MAX_base_lat                          2       1k      1.9k        0.00s      1.16       1.72       0.58
FI_UINT16_FI_MAX_base_lat                         2       1k      1.9k        0.00s      1.15       1.74       0.58
FI_INT32_FI_MAX_base_lat                          4       1k      3.9k        0.00s      2.34       1.71       0.58
FI_UINT32_FI_MAX_base_lat                         4       1k      3.9k        0.00s      2.33       1.72       0.58
FI_INT64_FI_MAX_base_lat                          8       1k      7.8k        0.00s      4.67       1.71       0.58
FI_UINT64_FI_MAX_base_lat                         8       1k      7.8k        0.00s      4.65       1.72       0.58
FI_FLOAT_FI_MAX_base_lat                          4       1k      3.9k        0.00s      2.32       1.72       0.58
FI_DOUBLE_FI_MAX_base_lat                         8       1k      7.8k        0.00s      4.96       1.61       0.62
FI_INT8_FI_SUM_base_lat                           1       1k      1000        0.00s      0.59       1.70       0.59
FI_UINT8_FI_SUM_base_lat                          1       1k      1000        0.00s      0.59       1.70       0.59
FI_INT16_FI_SUM_base_lat                          2       1k      1.9k        0.00s      1.18       1.69       0.59
FI_UINT16_FI_SUM_base_lat                         2       1k      1.9k        0.00s      1.18       1.69       0.59
FI_INT32_FI_SUM_base_lat                          4       1k      3.9k        0.00s      2.32       1.72       0.58
FI_UINT32_FI_SUM_base_lat                         4       1k      3.9k        0.00s      2.35       1.70       0.59
FI_INT64_FI_SUM_base_lat                          8       1k      7.8k        0.00s      4.72       1.69       0.59
FI_UINT64_FI_SUM_base_lat                         8       1k      7.8k        0.00s      4.64       1.73       0.58
FI_FLOAT_FI_SUM_base_lat                          4       1k      3.9k        0.00s      2.32       1.72       0.58
FI_DOUBLE_FI_SUM_base_lat                         8       1k      7.8k        0.00s      4.60       1.74       0.57
FI_FLOAT_COMPLEX_FI_SUM_base_lat                  8       1k      7.8k        0.00s      4.80       1.67       0.60
FI_INT8_FI_PROD_base_lat                          1       1k      1000        0.00s      0.58       1.73       0.58
FI_UINT8_FI_PROD_base_lat                         1       1k      1000        0.00s      0.58       1.72       0.58
FI_INT16_FI_PROD_base_lat                         2       1k      1.9k        0.00s      1.18       1.70       0.59
FI_UINT16_FI_PROD_base_lat      Provider doesn't support FI_PROD base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_PROD base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_PROD base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_PROD base atomic operation on FI_INT128
Provider doesn't support FI_PROD base atomic operation on FI_UINT128
Provider doesn't support FI_LOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_LOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_LOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_LOR base atomic operation on FI_INT128
Provider doesn't support FI_LOR base atomic operation on FI_UINT128
Provider doesn't support FI_LAND base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_LAND base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_LAND base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_LAND base atomic operation on FI_INT128
Provider doesn't support FI_LAND base atomic operation on FI_UINT128
                  2       1k      1.9k        0.00s      1.17       1.72       0.58
FI_INT32_FI_PROD_base_lat                         4       1k      3.9k        0.00s      2.33       1.71       0.58
FI_UINT32_FI_PROD_base_lat                        4       1k      3.9k        0.00s      2.31       1.73       0.58
FI_INT64_FI_PROD_base_lat                         8       1k      7.8k        0.00s      4.72       1.70       0.59
FI_UINT64_FI_PROD_base_lat                        8       1k      7.8k        0.00s      4.66       1.72       0.58
FI_FLOAT_FI_PROD_base_lat                         4       1k      3.9k        0.00s      2.31       1.73       0.58
FI_DOUBLE_FI_PROD_base_lat                        8       1k      7.8k        0.00s      4.64       1.73       0.58
FI_FLOAT_COMPLEX_FI_PROD_base_lat                 8       1k      7.8k        0.00s      4.85       1.65       0.61
FI_INT8_FI_LOR_base_lat                           1       1k      1000        0.00s      0.60       1.65       0.60
FI_UINT8_FI_LOR_base_lat                          1       1k      1000        0.00s      0.60       1.66       0.60
FI_INT16_FI_LOR_base_lat                          2       1k      1.9k        0.00s      1.18       1.70       0.59
FI_UINT16_FI_LOR_base_lat                         2       1k      1.9k        0.00s      1.20       1.67       0.60
FI_INT32_FI_LOR_base_lat                          4       1k      3.9k        0.00s      2.39       1.67       0.60
FI_UINT32_FI_LOR_base_lat                         4       1k      3.9k        0.00s      2.40       1.67       0.60
FI_INT64_FI_LOR_base_lat                          8       1k      7.8k        0.00s      4.77       1.68       0.60
FI_UINT64_FI_LOR_base_lat                         8       1k      7.8k        0.00s      4.62       1.73       0.58
FI_FLOAT_FI_LOR_base_lat                          4       1k      3.9k        0.00s      2.34       1.71       0.58
FI_DOUBLE_FI_LOR_base_lat                         8       1k      7.8k        0.00s      4.69       1.70       0.59
FI_FLOAT_COMPLEX_FI_LOR_base_lat                  8       1k      7.8k        0.00s      5.11       1.57       0.64
FI_INT8_FI_LAND_base_lat                          1       1k      1000        0.00s      0.58       1.71       0.58
FI_UINT8_FI_LAND_base_lat                         1       1k      1000        0.00s      0.58       1.71       0.58
FI_INT16_FI_LAND_base_lat                         2       1k      1.9k        0.00s      1.20       1.67       0.60
FI_UINT16_FI_LAND_base_lat                        2       1k      1.9k        0.00s      1.20       1.67       0.60
FI_INT32_FI_LAND_base_lat                         4       1k      3.9k        0.00s      2.39       1.67       0.60
FI_UINT32_FI_LAND_base_lat                        4       1k      3.9k        0.00s      2.45       1.63       0.61
FI_INT64_FI_LAND_base_lat                         8       1k      7.8k        0.00s      4.79       1.67       0.60
FI_UINT64_FI_LAND_base_lat                        8       1k      7.8k        0.00s      4.68       1.71       0.58
FI_FLOAT_FI_LAND_base_lat                         4       1k      3.9k        0.00s      2.29       1.75       0.57
FI_DOUBLE_FI_LAND_base_lat                        8       1k      7.8k        0.00s      4.72       1.69       0.59
FI_FLOAT_COMPLEX_FI_LAND_base_lat                 8       1k      7.8k        0.00s      5.24       1.53       0.65
FI_INT8_FI_BOR_base_lat                           1       1k      1000        0.00s      0.59       1.70       0.59
FI_UINT8_FI_BOR_base_lat                          1       1k      1000        0.00s      0.59       1.70       0.59
FI_INT16_FI_BOR_base_lat                          2       1k      1.9k        0.00s      1.17       1.71       0.58
FI_UINT16_FI_BOR_base_lat                         2       1k      1.9k        0.00s      1.16       1.72       0.58
FI_INT32_FI_BOR_base_lat                          4       1k      3.9k        0.00s      2.34       1.71       0.59
FI_UINT32_FI_BOR_base_lat                         4       1k      3.Provider doesn't support FI_BOR base atomic operation on FI_FLOAT
Provider doesn't support FI_BOR base atomic operation on FI_DOUBLE
Provider doesn't support FI_BOR base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_BOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_BOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_BOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_BOR base atomic operation on FI_INT128
Provider doesn't support FI_BOR base atomic operation on FI_UINT128
Provider doesn't support FI_BAND base atomic operation on FI_FLOAT
Provider doesn't support FI_BAND base atomic operation on FI_DOUBLE
Provider doesn't support FI_BAND base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_BAND base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_BAND base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_BAND base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_BAND base atomic operation on FI_INT128
Provider doesn't support FI_BAND base atomic operation on FI_UINT128
Provider doesn't support FI_LXOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_LXOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_LXOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_LXOR base atomic operation on FI_INT128
Provider doesn't support FI_LXOR base atomic operation on FI_UINT128
Provider doesn't support FI_BXOR base atomic operation on FI_FLOAT
Provider doesn't support FI_BXOR base atomic operation on FI_DOUBLE
Provider doesn't support FI_BXOR base atomic operation on FI_FLOAT_COMPLEX
Provider doesn't support FI_BXOR base atomic operation on FI_DOUBLE_COMPLEX
Provider doesn't support FI_BXOR base atomic operation on FI_LONG_DOUBLE
Provider doesn't support FI_BXOR base atomic operation on FI_LONG_DOUBLE_COMPLEX
Provider doesn't support FI_BXOR base atomic operation on FI_INT128
Provider doesn't support FI_BXOR base atomic operation on FI_UINT128
timeout: the monitored command dumped core

server returncode: 255

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 6, 2023

It also failed theosu_get_acc_latency test with Open MPI on 2 node for EFA provider, which I don't understand right now. As it shouldn't use shm in this case


INFO     root:utils.py:67 Executing command: export PATH=/opt/amazon/openmpi/bin:$PATH;export LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404/install/libfabric/lib;/opt/amazon/openmpi/bin/mpirun --wdir . -n 2 --hostfile /home/ec2-user/PortaFiducia/hostfile --map-by ppr:1:node --timeout 1800 -x MPIEXEC_TIMEOUT=1800 -x LD_LIBRARY_PATH=/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404/install/libfabric/lib -x PATH  /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.5/source/osu-micro-benchmarks-7.0-lrbison3/c/mpi/one-sided/osu_get_acc_latency   2>&1 | tee /home/ec2-user/PortaFiducia/build/workloads/omb/openmpi-v4.1.5/run/one-sided/osu_get_acc_latency/node2-ppn1.txt
INFO     root:utils.py:466 mpirun output:
# OSU MPI_Get_accumulate latency Test v7.0-lrbison3
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Size          Latency (us)
[ip-172-31-45-84:836217] *** Process received signal ***
[ip-172-31-45-84:836217] Signal: Segmentation fault (11)
[ip-172-31-45-84:836217] Signal code: Address not mapped (1)
[ip-172-31-45-84:836217] Failing at address: 0x3200000049
[ip-172-31-45-84:836217] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7fc5d05c9cf0]
[ip-172-31-45-84:836217] [ 1] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404/install/libfabric/lib/libfabric.so.1(+0x1f32c)[0x7fc5c34ce32c]
[ip-172-31-45-84:836217] [ 2] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404/install/libfabric/lib/libfabric.so.1(+0x1f543)[0x7fc5c34ce543]
[ip-172-31-45-84:836217] [ 3] /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr9404/install/libfabric/lib/libfabric.so.1(+0xb04a3)[0x7fc5c355f4a3]
[ip-172-31-45-84:836217] [ 4] /home/ec2-user/PortaFiducia/build/

ZE v2 hardware no longer a priority for testing, replaced by v3

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Adding FI_HMEM support to atomic calls requires the correct shm descriptor for the
result and compare iovs. Efa was passing in the efa descriptors and not the shm ones.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
@aingerson aingerson changed the title prov/shm, fabtests/hmem, contrib/intel: enable full fabtests FI_HMEM support and enable in Intel CI prov/shm, prov/efa, fabtests/hmem, contrib/intel: enable full fabtests FI_HMEM support and enable in Intel CI Oct 12, 2023
@aingerson
Copy link
Contributor Author

@shijin-aws @a-szegel Updated PR with fix in efa for passing in the correct shm descriptor into the atomic calls. Let me know if you see any issues with it!

@aingerson
Copy link
Contributor Author

@a-szegel @shijin-aws The AWS CI was still failing with a similar MR descriptor issue so I added another fix and now it's passing but it's weird that everything else was passing without that and only the atomic path seems to be problematic... Let me know what you think or if you want to rerun/do more testing to investigate the impact before merging.

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 13, 2023

@aingerson Your fix to efa provider looks reasonable to me, thanks.

it's weird that everything else was passing without that and only the atomic path seems to be problematic...

Are you talking about this fix 61140ec?

I think that function assumes shm_desc is only updated when efa_mr is present. We already set shm_desc as NULL before calling this function in efa_rdm_msg and efa_rdm_rma. Before your change, efa_rdm_atomic_init_shm_msg implement this initialization. I think your fix is good for the general usage of this function, thanks!

@aingerson
Copy link
Contributor Author

@shijin-aws Thank you for the explanation! I missed the NULL initialization in the msg and RMA paths but that makes sense now. Would you like me to remove the other initialization so that you don't have to initialize twice? Or just leave it?

@shijin-aws
Copy link
Contributor

I think removing other initialization is better, thank you

@aingerson
Copy link
Contributor Author

@shijin-aws Updated to remove the NULL initialization in the other places. It was missing in the readmsg and writemsg paths so could be good to backport the change to make sure. I'll leave that up to you though.

@shijin-aws
Copy link
Contributor

I will backport your fix commits for efa to v1.19.x, thanks

@aingerson
Copy link
Contributor Author

@shijin-aws @a-szegel More failures happened when I removed the NULL initialization, and it was because there were a lot of checks that initialized the shm_desc conditionally so I had to remove them all. I also did a little of cleanup with some initialization that seemed unnecessary to cleanup the shm send path. Let me know if I overstepped and you want me to undo those.

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 14, 2023

@aingerson how about we just initialize shm_res_desc and shm_comp_desc as NULL, and keep efa_rdm_get_desc_for_shm (and other efa code) unchanged? Then it should also pass the CI, right?. I would prefer to keep efa_rdm_get_desc_for_shm with less instructions as it's in a fast path.. WDYT? We will follow up and refactor them carefully. Sorry for your extra work!

shm now has atomic support for FI_HMEM can can be enabled

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Add support for FI_HMEM and atomics by passing the desc/ofi_mr to copy atomic
data to and from the shm region. On the target side, use a temporary host buffer
for the atomic operation and then copy the result into the destination using
the hmem functions

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Use the dev_host_buf for filling and checking the sent/received greeting
in order to use device memory for the actual transfer.

This also moves the received data print into the check greeting function and
removes it from all places it is called from.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Use dev_host_buf to copy data into and out of transfer buffers to support
FI_HMEM

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
The uint64_t device was updated to include the device and driver
indices. The index is isolated using the ze_get_device_idx function
which masks the driver bits. The cmd_queue used in the copy function
was using the full device rather than just the dev_id

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
Since not all of the fabtests supported device memory, the CI used
a wrapper script to run a subset of tests that were supported.
Now that all of the runfabtests support FI_HMEM properly, this simplifies
the ZE testing by going through the regular run_fabtests path.
The --device parameter is removed and replaced with the --way parameter
which indicates which direction to test (h2d, d2d, xd2d, default None).
This simplifies the code path and enables the ZE testing to run the
full testsuite, rather than the subset, increasing our coverage.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
@a-szegel
Copy link
Contributor

The AWS failure is not related to this change (TCP).

@a-szegel
Copy link
Contributor

bot:aws:retest

@zachdworkin zachdworkin merged commit fe66cc4 into ofiwg:main Oct 16, 2023
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants