DAOS-16445 client: Add function to cycle OIDs non-sequentially #14999

jolivier23 · 2024-08-24T04:08:37Z

We've noticed that with sequential order, object placement is poor.

We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank
with EC_2P1G8. With this patch, we get a much better distribution.

This patch adds the following:

A function for cycling oid.hi incrementing by a large prime
For DFS, randomize the starting value
Modify DFS to cycle OIDs using the new function.

Required-githooks: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-08-24T04:08:54Z

Ticket title is 'Experiment with different oid generation algorithm'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16445

src/client/dfs/dfs_internal.h

daosbuild1 · 2024-08-24T04:42:44Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14999/2/testReport/

daosbuild1 · 2024-08-25T04:19:33Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14999/6/execution/node/1463/log

jolivier23 · 2024-08-28T17:57:00Z

@mchaarawi In our simulations, this does quite a bit better.

mchaarawi · 2024-08-28T18:30:59Z

I am uncertain as to why this resolves things.
it sounds to me like this issue needs to be fixed at a lower level and not in the POSIX or other middleware layers.

jolivier23 · 2024-08-28T20:49:15Z

I am uncertain as to why this resolves things. it sounds to me like this issue needs to be fixed at a lower level and not in the POSIX or other middleware layers.

I agree to some extent. It sounds like the placement algorithm doesn't work very well with mostly sequential oid allocation (e.g. not even close to balanced placement). I'm not sure how to fix the allocation order in any other way than in the client logic that cycles through them. If we want to fix placement hash such that it works better with sequentially allocated oids, that is probably best but also a layout change.

jolivier23 · 2024-08-28T22:10:06Z

I am uncertain as to why this resolves things. it sounds to me like this issue needs to be fixed at a lower level and not in the POSIX or other middleware layers.

I agree to some extent. It sounds like the placement algorithm doesn't work very well with mostly sequential oid allocation (e.g. not even close to balanced placement). I'm not sure how to fix the allocation order in any other way than in the client logic that cycles through them. If we want to fix placement hash such that it works better with sequentially allocated oids, that is probably best but also a layout change.

It may be good to have this be more configurable so we can play with settings.

I suppose one other thing we could do is to do something similar with oid.lo but I suspect that may still have this issue where sequential allocation doesn't place well.

mchaarawi · 2024-08-29T14:31:25Z

I am uncertain as to why this resolves things. it sounds to me like this issue needs to be fixed at a lower level and not in the POSIX or other middleware layers.

I agree to some extent. It sounds like the placement algorithm doesn't work very well with mostly sequential oid allocation (e.g. not even close to balanced placement). I'm not sure how to fix the allocation order in any other way than in the client logic that cycles through them. If we want to fix placement hash such that it works better with sequentially allocated oids, that is probably best but also a layout change.

It may be good to have this be more configurable so we can play with settings.

I suppose one other thing we could do is to do something similar with oid.lo but I suspect that may still have this issue where sequential allocation doesn't place well.

if you touch oid.lo then you are introducing a dfs layout change and will have to add compatibility code.
im more worried about non DFS use cases.. it sounds like this is a more generic problem that needs a better solution IMO.

jolivier23 · 2024-08-29T15:17:59Z

I am uncertain as to why this resolves things. it sounds to me like this issue needs to be fixed at a lower level and not in the POSIX or other middleware layers.

I agree to some extent. It sounds like the placement algorithm doesn't work very well with mostly sequential oid allocation (e.g. not even close to balanced placement). I'm not sure how to fix the allocation order in any other way than in the client logic that cycles through them. If we want to fix placement hash such that it works better with sequentially allocated oids, that is probably best but also a layout change.

It may be good to have this be more configurable so we can play with settings.
I suppose one other thing we could do is to do something similar with oid.lo but I suspect that may still have this issue where sequential allocation doesn't place well.

if you touch oid.lo then you are introducing a dfs layout change and will have to add compatibility code. im more worried about non DFS use cases.. it sounds like this is a more generic problem that needs a better solution IMO.

Only option I can think of is making a generic version of oid_gen, something like

daos_cont_oid_gen(coh, &oid);

which basically does what oid_gen does. I can take a crack at it. We may need to cherry-pick this patch in the near term though.

and maybe something like
daos_cont_oid_init(coh, num_reserved) to reserve the first few oids in oid.lo==0

We've noticed that with 8GB files we fill space pretty quickly. Just trying another method to see if it works any better. Basic idea is two fold: 1. Use a large prime number as the increment as we cycle through the oids. 2. Randomize which one we start with on a per mount basis. Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>

Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>

jolivier23 · 2024-08-30T17:51:19Z

We get 80% utilization before ENOSPC with 64GiB files with this patch. Previously, 8GiB files were hitting ENOSPC at around 40% utilization. We are testing 8GiB files next but this looks promising.

mchaarawi

it's still surprising to me why the sequential allocation does such a bad job, but this is OK for me and does not introduce layout compatibility issues with existing containers.

jolivier23 · 2024-08-30T21:08:42Z

it's still surprising to me why the sequential allocation does such a bad job, but this is OK for me and does not introduce layout compatibility issues with existing containers.

Agreed, should probably open a ticket on this at some point though maybe not that important with a workaround. I suppose we should call this out in the docs somewhere that this is important. @Michael-Hennecke do you have a place that would fit?

karthjyojay · 2024-08-30T21:41:09Z

src/include/daos_obj.h

@@ -564,6 +564,21 @@ daos_obj_generate_oid(daos_handle_t coh, daos_obj_id_t *oid,
 		      enum daos_otype_t type, daos_oclass_id_t cid,
 		      daos_oclass_hints_t hints, uint32_t args);

+
+/**
+ * This function, if called 2^32 times will set oid->hi to every unique 32-bit


I'm confused by the first sentence of this method. Are oids 32 bit integers which means there are 2^32 possible combinations? Is this function just getting us a unique hi every time it is called? When does it get called?

The oid has 2 64-bit integers, lo and hi. The lower 96 bits (all of lo and the lo 32-bits of hi) are supposed to be unique within a container for a given object. DAOS provides a method that allocates a globally unique lo oid upon request for the container. The user, in this case DFS, then has 2^32 unique oids it can allocate to produce a globally unique oid within the container. When it runs out, it will request a new lo (it probably won't run out as that would require the process to create 2^32 + 1 objects).

This function is just a helper function to cycle through the range of available oids.

The upper 32-bits of hi are used by DAOS to encode things like the object class

We've noticed that with sequential order, object placement is poor. We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank with EC_2P1G8. With this patch, we get a much better distribution. This patch adds the following: 1. A function for cycling oid.hi incrementing by a large prime 2. For DFS, randomize the starting value 3. Modify DFS to cycle OIDs using the new function. Required-githooks: true Change-Id: I4b35ba1b3e2285f760d3be8db9e5bc724f24d772 Signed-off-by: Jeff Olivier <jeffolivier@google.com>

… (#15052) We've noticed that with sequential order, object placement is poor. We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank with EC_2P1G8. With this patch, we get a much better distribution. This patch adds the following: 1. A function for cycling oid.hi incrementing by a large prime 2. For DFS, randomize the starting value 3. Modify DFS to cycle OIDs using the new function. Change-Id: I4b35ba1b3e2285f760d3be8db9e5bc724f24d772 Signed-off-by: Jeff Olivier <jeffolivier@google.com>

We've noticed that with sequential order, object placement is poor. We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank with EC_2P1G8. With this patch, we get a much better distribution. This patch adds the following: 1. A function for cycling oid.hi incrementing by a large prime 2. For DFS, randomize the starting value 3. Modify DFS to cycle OIDs using the new function. Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>

* DAOS-16484 test: Exclude local host in default interface selection (#15049) When including the local host in the default interface selection a difference in ib0 speeds will cause the logic to select eth0 and then the tcp provider. Signed-off-by: Phil Henderson <phillip.henderson@intel.com> * DAOS-15800 client: create cart context on specific interface (#14804) Cart has added the ability to select network interface on context creation. The daos_agent also added a numa-fabric map that can be queried at init time. Update the DAOS client to query from the agent a map of numa to network interface on daos_init(), and on EQ creation, select the best interface for the network context based on the numa of the calling thread. Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> * DAOS-16445 client: Add function to cycle OIDs non-sequentially (#14999) We've noticed that with sequential order, object placement is poor. We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank with EC_2P1G8. With this patch, we get a much better distribution. This patch adds the following: 1. A function for cycling oid.hi incrementing by a large prime 2. For DFS, randomize the starting value 3. Modify DFS to cycle OIDs using the new function. Signed-off-by: Jeff Olivier <jeffolivier@google.com> * DAOS-16251 dtx: Fix dtx_req_send user-after-free (#15035) In dtx_req_send, since the crt_req_send releases the req reference, din may have been freed when dereferenced for the DL_CDEBUG call. Signed-off-by: Li Wei <wei.g.li@intel.com> * DAOS-16304 tools: Add daos health net-test command (#14980) Wrap self_test to provide a simplified network test to detect obvious client/server connectivity and performance problems. Signed-off-by: Michael MacDonald <mjmac@google.com> * DAOS-16272 dfs: fix get_info returning incorrect oclass (#15048) If user creates a container without --file-oclass, the get_info call was returning the default oclass of a directory on daos fs get-attr. Fix that to properly use the enum types for default scenario. Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> * DAOS-15863 container: fix a race for container cache (#15038) * DAOS-15863 container: fix a race for container cache while destroying a container, cont_child_destroy_one() releases its own refcount before waiting, if another ULT releases its refcount, which is the last one, wakes up the waiting ULT and frees it ds_cont_child straightaway, because no one else has refcount. When the waiting ULT is waken up, it will try to change the already freed ds_cont_child. This patch changes the LRU eviction logic and fixes this race. Signed-off-by: Liang Zhen <liang.zhen@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com> Co-authored-by: Jeff Olivier <jeffolivier@google.com> * DAOS-16471 test: Reduce targets for ioctl_pool_handles.py (#15063) The dfuse/ioctl_pool_handles.py test is overloading the VM so reduce the number of engine targets. Signed-off-by: Phil Henderson <phillip.henderson@intel.com> * DAOS-16483 vos: handle empty DTX when vos_tx_end (#15053) It is possible that the DTX modified nothing when stop currnet backend transaction. Under such case, we may not generate persistent DTX entry. Then need to bypass such case before checking on-disk DTX entry status. The patch makes some clean and removed redundant metrics for committed DTX entries. Enhance vos_dtx_deregister_record() to handle GC case. Signed-off-by: Fan Yong <fan.yong@intel.com> * DAOS-16271 mercury: Add patch to avoid seg fault in key resolve. (#15067) Signed-off-by: Joseph Moore <joseph.moore@intel.com> * DAOS-16484 test: Support mixed speeds when selecting a default interface (#15050) Allow selecting a default interface that is running at a different speed on different hosts. Primarily this is to support selecting the ib0 interface by default when the launch node has a slower ib0 interface than the cluster hosts. Signed-off-by: Phil Henderson <phillip.henderson@intel.com> * DAOS-16446 test: HDF5-VOL test - Set object class and container prope… (#15004) In HDF5, DFS, MPIIO, or POSIX, object class and container properties are defined during the container create. If it’s DFS, object class is also set to the IOR parameter. However, in HDF5-VOL, object class and container properties are defined with the following environment variables of mpirun. HDF5_DAOS_OBJ_CLASS (Object class) HDF5_DAOS_FILE_PROP (Container properties) The infrastructure to set these variables are already there in run_ior_with_pool(). In file_count_test_base.py, pass in the env vars to run_ior_with_pool(env=env) as a dictionary. Object class is the oclass variable. Container properties can be obtained from container -> properties field in the test yaml. This fix is discussed in PR #14964. Signed-off-by: Makito Kano <makito.kano@intel.com> * DAOS-16447 test: set D_IL_REPORT per test (#15012) set D_IL_REPORT per test instead of setting defaults values in utilities. This allows running without it set. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com> * DAOS-16450 test: auto run dfs tests when dfs is modified (#15017) Automatically include dfs tests when dfs files are modified in PRs. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com> * DAOS-16510 cq: update pylint to 3.2.7 (#15072) update pylint to 3.2.7 Signed-off-by: Dalton Bohning <dalton.bohning@intel.com> * DAOS-16509 test: replace IorTestBase.execute_cmd with run_remote (#15070) replace usage of IorTestBase.execute_cmd with run_remote Signed-off-by: Dalton Bohning <dalton.bohning@intel.com> * DAOS-16458 object: fix invalid DRAM access in obj_bulk_transfer (#15026) For EC object update via CPD RPC, when calculate the bitmap to skip some iods for current EC data shard, we may input NULL for "*skips" parameter. It may cause the old logic in obj_get_iods_offs_by_oid() to generate some undefined DRAM for "skips" bitmap. Such bitmap may be over-written by others, as to subsequent obj_bulk_transfer() may be misguided. The patch also fixes a bug inside obj_bulk_transfer() that cast any input RPC as UPDATE/FETCH by force. Signed-off-by: Fan Yong <fan.yong@intel.com> * DAOS-16486 object: return proper error on stale pool map (#15064) Client with stale pool map may try to send RPC to a DOWN target, if the target was brought DOWN due to faulty NVMe device, the ds_pool_child could have been stopped on the NVMe faulty reaction, We'd ensure proper error code is returned for such case. Signed-off-by: Niu Yawei <yawei.niu@intel.com> * DAOS-16514 vos: fix coverity issue (#15083) Fix coverity 2555843 explict null dereferenced. Signed-off-by: Niu Yawei <yawei.niu@intel.com> * DAOS-16467 rebuild: add DAOS_POOL_RF ENV for massive failure case (#15037) * DAOS-16467 rebuild: add DAOS_PW_RF ENV for massive failure case Allow user to set DAOS_PW_RF as pw_rf (pool wise RF). If SWIM detected engine failure is going to break pw_rf, don't change pool map, also don't trigger rebuild. With critical log message to ask administrator to bring back those engines in top priority (just "system start --ranks=xxx", need not to reintegrate those engines). a few functions renamed to avoid confuse - pool_map_find_nodes() -> pool_map_find_ranks() pool_map_find_node_by_rank() -> pool_map_find_dom_by_rank() pool_map_node_nr() -> pool_map_rank_nr() Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> * DAOS-16508 csum: retry a few times on checksum mismatch on update (#15069) Unlike fetch, we return DER_CSUM on update (turned into EIO by dfs) without any retry. We should retry a few times in case it is a transient error. The patch also prints more information about the actual checksum mismatch. Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com> * DAOS-10877 vos: gang allocation for huge SV (#14790) To avoid allocation failure on a fragmented system, huge SV allocation will be split into multiple smaller allocations, each allocation size is capped to 8MB (the DMA chunk size, that could avoid huge DMA buffer allocation). The address of such scattered SV payload is represented by 'gang address'. Removed io_allocbuf_failure() vos unit test, it's not applicable in gang SV mode now. Signed-off-by: Niu Yawei <yawei.niu@intel.com> * DAOS-16304 tools: Adjust default RPC size for net-test (#15091) The previous default of 1MiB isn't helpful at large scales. Use a default of 1KiB to get faster results and a better balance between raw latency and bandwidth. Also include calculated rpc throughput and bandwidth in JSON output. Signed-off-by: Michael MacDonald <mjmac@google.com> * SRE-2408 ci: Increase timeout (to 15 minutes) for system restore (#14926) Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com> * DAOS-16251 object: Fix obj_ec_singv_split overflow (#15045) It has been seen that obj_ec_singv_split may read beyond the end of sgl->sg_iovs[0].iov_buf: iod_size=8569 c_bytes=4288 id_shard=0 tgt_off=1 iov_len=8569 iov_buf_len=8569 The memmove read 4288 bytes from offset 4288, whereas the buffer only had 8569 - 4288 = 4281 bytes from offset 4288. This patch fixes the problem by adding the min(...) expression. Signed-off-by: Li Wei <wei.g.li@intel.com> --------- Signed-off-by: Phil Henderson <phillip.henderson@intel.com> Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com> Signed-off-by: Jeff Olivier <jeffolivier@google.com> Signed-off-by: Li Wei <wei.g.li@intel.com> Signed-off-by: Michael MacDonald <mjmac@google.com> Signed-off-by: Liang Zhen <liang.zhen@intel.com> Signed-off-by: Fan Yong <fan.yong@intel.com> Signed-off-by: Joseph Moore <joseph.moore@intel.com> Signed-off-by: Makito Kano <makito.kano@intel.com> Signed-off-by: Dalton Bohning <dalton.bohning@intel.com> Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com> Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com> Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com> Co-authored-by: Phil Henderson <phillip.henderson@intel.com> Co-authored-by: Jeff Olivier <jeffolivier@google.com> Co-authored-by: Li Wei <wei.g.li@intel.com> Co-authored-by: Michael MacDonald <mjmac@google.com> Co-authored-by: Liang Zhen <liang.zhen@intel.com> Co-authored-by: Nasf-Fan <fan.yong@intel.com> Co-authored-by: Joseph Moore <26410038+jgmoore-or@users.noreply.github.com> Co-authored-by: Makito Kano <makito.kano@intel.com> Co-authored-by: Dalton Bohning <dalton.bohning@intel.com> Co-authored-by: Niu Yawei <yawei.niu@intel.com> Co-authored-by: Liu Xuezhao <xuezhao.liu@intel.com> Co-authored-by: Johann Lombardi <johann.lombardi@gmail.com> Co-authored-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

… (#15117) We've noticed that with sequential order, object placement is poor. We get 40% fill for 8GiB files with 25 ranks and 16 targets per rank with EC_2P1G8. With this patch, we get a much better distribution. This patch adds the following: 1. A function for cycling oid.hi incrementing by a large prime 2. For DFS, randomize the starting value 3. Modify DFS to cycle OIDs using the new function. Signed-off-by: Jeff Olivier <jeffolivier@google.com>

jolivier23 force-pushed the jvolivie/use_oids branch from 1c032dd to c5b5fe3 Compare August 24, 2024 04:11

jolivier23 commented Aug 24, 2024

View reviewed changes

src/client/dfs/dfs_internal.h Outdated Show resolved Hide resolved

jolivier23 force-pushed the jvolivie/use_oids branch 4 times, most recently from 2cb0bc7 to c10ebb8 Compare August 24, 2024 05:59

jolivier23 changed the title ~~DAOS-16445 client: Try cycling different method~~ DAOS-16445 client: New oid allocation order Aug 24, 2024

jolivier23 force-pushed the jvolivie/use_oids branch from c10ebb8 to 679554a Compare August 27, 2024 16:11

jolivier23 marked this pull request as ready for review August 28, 2024 17:55

jolivier23 requested review from a team as code owners August 28, 2024 17:55

jolivier23 requested a review from mchaarawi August 28, 2024 17:55

jolivier23 requested review from a team as code owners August 29, 2024 19:31

jolivier23 added 3 commits August 29, 2024 13:32

Add a function users can use to cycle oids

1b5a38c

Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>

jolivier23 force-pushed the jvolivie/use_oids branch from d72de06 to 1b5a38c Compare August 29, 2024 19:34

jolivier23 removed request for a team August 29, 2024 19:48

Fix clang-format issues

ab1c12b

Required-githooks: true Signed-off-by: Jeff Olivier <jeffolivier@google.com>

jolivier23 requested a review from gnailzenh August 30, 2024 17:50

mchaarawi approved these changes Aug 30, 2024

View reviewed changes

jolivier23 changed the title ~~DAOS-16445 client: New oid allocation order~~ DAOS-16445 client: Add function to cycle OIDs non-sequentially Aug 30, 2024

karthjyojay self-requested a review August 30, 2024 21:35

karthjyojay approved these changes Aug 30, 2024

View reviewed changes

jolivier23 merged commit d2f062a into master Aug 30, 2024
58 checks passed

jolivier23 deleted the jvolivie/use_oids branch August 30, 2024 21:55

This was referenced Oct 7, 2024

Merge upstream/release/2.6 into upstream/google/2.6 #15264

Closed

Merge upstream/release/2.6 into upstream/google/2.6 #15265

Closed

This was referenced Oct 8, 2024

Merge upstream/release/2.6 into upstream/google/2.6 #15276

Closed

Merge upstream/release/2.6 into upstream/google/2.6 #15317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16445 client: Add function to cycle OIDs non-sequentially #14999

DAOS-16445 client: Add function to cycle OIDs non-sequentially #14999

jolivier23 commented Aug 24, 2024 •

edited

Loading

github-actions bot commented Aug 24, 2024

daosbuild1 commented Aug 24, 2024

daosbuild1 commented Aug 25, 2024

jolivier23 commented Aug 28, 2024

mchaarawi commented Aug 28, 2024

jolivier23 commented Aug 28, 2024

jolivier23 commented Aug 28, 2024

mchaarawi commented Aug 29, 2024

jolivier23 commented Aug 29, 2024

jolivier23 commented Aug 30, 2024

mchaarawi left a comment •

edited

Loading

jolivier23 commented Aug 30, 2024

karthjyojay Aug 30, 2024

jolivier23 Aug 30, 2024

jolivier23 Aug 30, 2024

DAOS-16445 client: Add function to cycle OIDs non-sequentially #14999

DAOS-16445 client: Add function to cycle OIDs non-sequentially #14999

Conversation

jolivier23 commented Aug 24, 2024 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Aug 24, 2024

daosbuild1 commented Aug 24, 2024

daosbuild1 commented Aug 25, 2024

jolivier23 commented Aug 28, 2024

mchaarawi commented Aug 28, 2024

jolivier23 commented Aug 28, 2024

jolivier23 commented Aug 28, 2024

mchaarawi commented Aug 29, 2024

jolivier23 commented Aug 29, 2024

jolivier23 commented Aug 30, 2024

mchaarawi left a comment • edited Loading

Choose a reason for hiding this comment

jolivier23 commented Aug 30, 2024

karthjyojay Aug 30, 2024

Choose a reason for hiding this comment

jolivier23 Aug 30, 2024

Choose a reason for hiding this comment

jolivier23 Aug 30, 2024

Choose a reason for hiding this comment

jolivier23 commented Aug 24, 2024 •

edited

Loading

mchaarawi left a comment •

edited

Loading