use _by_dnode() routines for additional performance wins #4802

ahrens · 2016-06-26T16:16:39Z

#4641 recommends adding *_by_dnode() routines for accessing objects given their dnode_t*, which is more efficient than accessing the object by (objset_t*, uint64_t object). We should convert additional performance-sensitive code paths to use these routines as we discover them.

For the concurrent file creation microbenchmark, the following additional code paths spend a lot of time in dnode_hold(), and could benefit from conversion:

dmu_tx_hold_object_impl() via dmu_tx_hold_sa()
dmu_tx_hold_object_impl() via dmu_tx_hold_sa_create()
dmu_tx_hold_zap(dzp->z_id) via zfs_create()
zap_add(dzp->z_id) via zfs_link_create()
dmu_tx_add_new_object() via dmu_object_alloc()
zap_lookup[_norm](dzp->z_id) via zfs_match_find()

It may be inconvenient to keep track of the dnode_t to supply in some of these code paths (or others). In that case we should consider implementing a "dnode cache": a global hashtable which maps from <objset_t*, objectID> to <dnode_t*>. This would avoid contention on the dn_mtx for dnodes that have been recently accessed. In terms of cost/reward, the cost of fixing this is medium, and the benefit is medium.

Analysis sponsored by Intel Corp.

The text was updated successfully, but these errors were encountered:

Add *_by_dnode() routines for accessing objects given their dnode_t *, this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Closes #5534 Issue #4802

wangdbang · 2017-02-15T10:26:16Z

@ahrens and @behlendorf , Could we use "dmu_tx_hold_write_by_dnode" to replace "dmu_tx_hold_write" in zvol_write?

behlendorf · 2017-02-15T17:21:07Z

@wangdbang we can. We can also make the same change to zvol_replay_write.

wangdbang · 2017-02-16T00:46:45Z

@behlendorf ,thank you for reply, another question, Could the tx(dmu_txt_t) be a member of the zvol? avoid to call the dmu_tx_create per zvol_write. the same way, if we use "dmu_tx_hold_write_by_dnode" to replace "dmu_tx_hold_write", Could the dnode be a member of the zvol? is that possible?

behlendorf · 2017-02-16T01:14:15Z

@wangdbang we should be able to safely add the dnode_t for the DMU_OST_ZVOL object to zvol_state_t as long and we keep a hold on it. A similar thing is already done for the bonus dbuf, zv->zv->dbuf.

The tx is another story. We need to create a new one for every write for the ZIL to function as designed.

By chance have you done any profiling of how expensive the dnode lookup is here?

ahrens · 2017-02-16T01:28:01Z

dmu_tx_hold_write_by_dnode() from zvol should be possible.

dmu_tx_create/assign/commit() for each write is fundamental (not just for the ZIL).

wangdbang · 2017-02-16T01:37:24Z

@behlendorf ,thanks, I did not evaluate the cost for dnode lookup, just a suggestion to add the dnode to zvol_state_t as a member from code view sides to avoid dnode lookup per zol_write. i'm a new user of ZOL, and study the code as a beginner. thanks again.

wangdbang · 2017-02-16T01:40:37Z

@ahrens and @behlendorf , if you have a patch, i would like to verify it.

wangdbang · 2017-02-16T09:08:26Z

@ahrens and @behlendorf , just evaluate zvol_write performance, My rough thought like this, add a dnode point variable zv_dn into zvol_state, assign the value through dnode_hold at first call in zvol_write, and then call dmu_tx_hold_write_by_dnode to replace the dmu_tx_hold_write, is that possible? if it is, i'll try it.

behlendorf · 2017-02-16T17:24:58Z

@wangdbang adding zv_dn into zvol_state_t makes sense but you'll want to take the hold it in zvol_setup_zv() and release it in zvol_shutdown_zv(). You'll then be able to use zv->zv_dn in zvol_write() for the dmu_tx_hold_write_by_dnode() call.

When benchmarking the writes I'd suggest using a small block size since that's the workload we'd expect to benefit most from this optimization. Thanks for looking at this!

wangdbang · 2017-02-17T00:22:37Z

@behlendorf ,thanks for your suggestion, i'll try it, and share the result if have any progress.

wangdbang · 2017-02-17T09:53:46Z

@behlendorf and @ahrens , here is the rough patch base on v0.7.0-rc3

diff -urN zfs-0.7.0.orig/include/sys/dmu.h zfs-0.7.0/include/sys/dmu.h
--- zfs-0.7.0.orig/include/sys/dmu.h    2017-01-21 02:31:52.000000000 +0800
+++ zfs-0.7.0/include/sys/dmu.h 2017-02-17 16:50:09.007384464 +0800
@@ -674,6 +674,8 @@

 dmu_tx_t *dmu_tx_create(objset_t *os);
 void dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len);
+int dmu_tx_dnode_hold(objset_t *os, uint64_t object, dnode_t **dnp);
+void dmu_tx_dnode_rele(dnode_t *dn);
 void dmu_tx_hold_write_by_dnode(dmu_tx_t *tx, dnode_t *dn, uint64_t off,
     int len);
 void dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off,
diff -urN zfs-0.7.0.orig/module/zfs/dmu_tx.c zfs-0.7.0/module/zfs/dmu_tx.c
--- zfs-0.7.0.orig/module/zfs/dmu_tx.c  2017-01-21 02:36:28.000000000 +0800
+++ zfs-0.7.0/module/zfs/dmu_tx.c       2017-02-17 16:49:52.934384295 +0800
@@ -453,6 +453,18 @@
        dmu_tx_count_dnode(txh);
 }

+int
+dmu_tx_dnode_hold(objset_t *os, uint64_t object, dnode_t **dnp)
+{
+        return (dnode_hold(os, object, FTAG, dnp));
+}
+
+void
+dmu_tx_dnode_rele(dnode_t *dn)
+{
+        dnode_rele(dn, FTAG);
+}
+
 void
 dmu_tx_hold_write_by_dnode(dmu_tx_t *tx, dnode_t *dn, uint64_t off, int len)
 {
diff -urN zfs-0.7.0.orig/module/zfs/zvol.c zfs-0.7.0/module/zfs/zvol.c
--- zfs-0.7.0.orig/module/zfs/zvol.c    2017-01-21 02:36:28.000000000 +0800
+++ zfs-0.7.0/module/zfs/zvol.c 2017-02-17 18:00:07.759383322 +0800
@@ -89,6 +89,7 @@
        struct hlist_node       zv_hlink;       /* hash link */
        atomic_t                zv_suspend_ref; /* refcount for suspend */
        krwlock_t               zv_suspend_lock;        /* suspend lock */
+       dnode_t                 *zv_dn;
 };

 typedef enum {
@@ -655,7 +656,7 @@
                if (bytes > volsize - off)      /* don't write past the end */
                        bytes = volsize - off;

-               dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);
+               dmu_tx_hold_write_by_dnode(tx, zv->zv_dn, off, bytes);

                /* This will only fail for ENOSPC */
                error = dmu_tx_assign(tx, TXG_WAIT);
@@ -980,6 +981,10 @@
        if (error)
                return (SET_ERROR(error));

+        error = dmu_tx_dnode_hold(os, ZVOL_OBJ, &zv->zv_dn);
+        if (error)
+                return (SET_ERROR(error));
+
        set_capacity(zv->zv_disk, volsize >> 9);
        zv->zv_volsize = volsize;
        zv->zv_zilog = zil_open(os, zvol_get_data);
@@ -1008,6 +1013,8 @@
        dmu_buf_rele(zv->zv_dbuf, zv);
        zv->zv_dbuf = NULL;

+       dmu_tx_dnode_rele(zv->zv_dn);
+
        /*
         * Evict cached data
         */

wangdbang · 2017-02-17T10:20:37Z

Here is the 8k test result with orion.

envrionment setup: 8 X INTEL SSDSC2BB80
RAIDZ1 command:zpool create -f POOL00 raidz sdb sdc sdd sde sdf sdg sdh sdi -o ashift=12
ZVOL command:zfs create -b 8k -V 200GB POOL00/VOL00
test command: orion -run advanced -testname sd -num_disks 32 -size_small 8 -type rand -cache_size 0 -matrix row -num_large 0 -write 100
test config: add "/dev/zd0" into sd.lun
v0.7.0-rc3 results:
turn 1st:
Maximum Small IOPS=12709 @ Small=1 and Large=0
Minimum Small Latency=0.08 @ Small=1 and Large=0

turn 2nd:
Maximum Small IOPS=9355 @ Small=3 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

turn 3rd:
Maximum Small IOPS=9091 @ Small=3 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

add the dnode to zvol_state to avoid dnode_hold per zvol_write calling, apply the patch.
create the pool and volume again
turn 1st:
Maximum Small IOPS=12678 @ Small=1 and Large=0
Minimum Small Latency=0.08 @ Small=1 and Large=0

turn 2nd:
Maximum Small IOPS=9121 @ Small=4 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

turn 3rd:
Maximum Small IOPS=8990 @ Small=1 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

wangdbang · 2017-02-17T10:23:47Z

@ahrens and @behlendorf ,It seems that the performance has a little reduce, would you like to give some suggestion? thanks.

Add *_by_dnode() routines for accessing objects given their dnode_t *, this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Closes openzfs#5534 Issue openzfs#4802

wangdbang · 2017-03-21T06:11:44Z

@wli5, should i use the code that your give the link?

wli5 · 2017-03-21T07:21:20Z

@wangdbang I didn't contribute to that code.

behlendorf · 2017-03-21T11:57:55Z

@tuxoko and @ryao who've recently been looking at the zvol might be interested in this.

tuxoko · 2017-03-21T18:53:39Z

The zvol number looks close enough that it looks like within marginal of error. One thing to note is that we already use zv_dbuf to hold the bonus buffer for similar reason, so if we switch to zv_dn, we should get rid of zv_dbuf.

stale · 2020-11-11T03:38:39Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf · 2020-11-11T18:21:42Z

This issue was resolved by #6058 and #10184.

behlendorf added the Type: Performance Performance improvement or performance problem label Jul 11, 2016

behlendorf added this to the 0.7.0 milestone Jul 11, 2016

behlendorf modified the milestones: 0.7.0, 0.8.0 Oct 11, 2016

behlendorf modified the milestones: 0.8.0, 0.9.0 Aug 3, 2018

behlendorf removed this from the 0.9.0 milestone Nov 11, 2019

stale bot added the Status: Stale No recent activity for issue label Nov 11, 2020

behlendorf closed this as completed Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use _by_dnode() routines for additional performance wins #4802

use _by_dnode() routines for additional performance wins #4802

ahrens commented Jun 26, 2016 •

edited

Loading

wangdbang commented Feb 15, 2017

behlendorf commented Feb 15, 2017

wangdbang commented Feb 16, 2017

behlendorf commented Feb 16, 2017 •

edited

Loading

ahrens commented Feb 16, 2017

wangdbang commented Feb 16, 2017

wangdbang commented Feb 16, 2017

wangdbang commented Feb 16, 2017

behlendorf commented Feb 16, 2017

wangdbang commented Feb 17, 2017

wangdbang commented Feb 17, 2017 •

edited by ahrens

Loading

wangdbang commented Feb 17, 2017

wangdbang commented Feb 17, 2017

wangdbang commented Mar 21, 2017

wli5 commented Mar 21, 2017

behlendorf commented Mar 21, 2017

tuxoko commented Mar 21, 2017

stale bot commented Nov 11, 2020

behlendorf commented Nov 11, 2020

use _by_dnode() routines for additional performance wins #4802

use _by_dnode() routines for additional performance wins #4802

Comments

ahrens commented Jun 26, 2016 • edited Loading

wangdbang commented Feb 15, 2017

behlendorf commented Feb 15, 2017

wangdbang commented Feb 16, 2017

behlendorf commented Feb 16, 2017 • edited Loading

ahrens commented Feb 16, 2017

wangdbang commented Feb 16, 2017

wangdbang commented Feb 16, 2017

wangdbang commented Feb 16, 2017

behlendorf commented Feb 16, 2017

wangdbang commented Feb 17, 2017

wangdbang commented Feb 17, 2017 • edited by ahrens Loading

wangdbang commented Feb 17, 2017

wangdbang commented Feb 17, 2017

wangdbang commented Mar 21, 2017

wli5 commented Mar 21, 2017

behlendorf commented Mar 21, 2017

tuxoko commented Mar 21, 2017

stale bot commented Nov 11, 2020

behlendorf commented Nov 11, 2020

ahrens commented Jun 26, 2016 •

edited

Loading

behlendorf commented Feb 16, 2017 •

edited

Loading

wangdbang commented Feb 17, 2017 •

edited by ahrens

Loading