Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use _by_dnode() routines for additional performance wins #4802

Closed
ahrens opened this issue Jun 26, 2016 · 19 comments
Closed

use _by_dnode() routines for additional performance wins #4802

ahrens opened this issue Jun 26, 2016 · 19 comments
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Inactive Not being actively updated Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem

Comments

@ahrens
Copy link
Member

ahrens commented Jun 26, 2016

#4641 recommends adding *_by_dnode() routines for accessing objects given their dnode_t*, which is more efficient than accessing the object by (objset_t*, uint64_t object). We should convert additional performance-sensitive code paths to use these routines as we discover them.

For the concurrent file creation microbenchmark, the following additional code paths spend a lot of time in dnode_hold(), and could benefit from conversion:

dmu_tx_hold_object_impl() via dmu_tx_hold_sa()
dmu_tx_hold_object_impl() via dmu_tx_hold_sa_create()
dmu_tx_hold_zap(dzp->z_id) via zfs_create()
zap_add(dzp->z_id) via zfs_link_create()
dmu_tx_add_new_object() via dmu_object_alloc()
zap_lookup[_norm](dzp->z_id) via zfs_match_find()

It may be inconvenient to keep track of the dnode_t to supply in some of these code paths (or others). In that case we should consider implementing a "dnode cache": a global hashtable which maps from <objset_t*, objectID> to <dnode_t*>. This would avoid contention on the dn_mtx for dnodes that have been recently accessed. In terms of cost/reward, the cost of fixing this is medium, and the benefit is medium.

Analysis sponsored by Intel Corp.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Jul 11, 2016
@behlendorf behlendorf added this to the 0.7.0 milestone Jul 11, 2016
@behlendorf behlendorf modified the milestones: 0.7.0, 0.8.0 Oct 11, 2016
behlendorf pushed a commit that referenced this issue Jan 13, 2017
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by 
(objset_t *, uint64_t object).  This change converts some but
not all of the existing consumers.  As performance-sensitive
code paths are discovered they should be converted to use
these routines.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes #5534 
Issue #4802
@wangdbang
Copy link

@ahrens and @behlendorf , Could we use "dmu_tx_hold_write_by_dnode" to replace "dmu_tx_hold_write" in zvol_write?

@behlendorf
Copy link
Contributor

@wangdbang we can. We can also make the same change to zvol_replay_write.

@wangdbang
Copy link

@behlendorf ,thank you for reply, another question, Could the tx(dmu_txt_t) be a member of the zvol? avoid to call the dmu_tx_create per zvol_write. the same way, if we use "dmu_tx_hold_write_by_dnode" to replace "dmu_tx_hold_write", Could the dnode be a member of the zvol? is that possible?

@behlendorf
Copy link
Contributor

behlendorf commented Feb 16, 2017

@wangdbang we should be able to safely add the dnode_t for the DMU_OST_ZVOL object to zvol_state_t as long and we keep a hold on it. A similar thing is already done for the bonus dbuf, zv->zv->dbuf.

The tx is another story. We need to create a new one for every write for the ZIL to function as designed.

By chance have you done any profiling of how expensive the dnode lookup is here?

@ahrens
Copy link
Member Author

ahrens commented Feb 16, 2017

dmu_tx_hold_write_by_dnode() from zvol should be possible.

dmu_tx_create/assign/commit() for each write is fundamental (not just for the ZIL).

@wangdbang
Copy link

@behlendorf ,thanks, I did not evaluate the cost for dnode lookup, just a suggestion to add the dnode to zvol_state_t as a member from code view sides to avoid dnode lookup per zol_write. i'm a new user of ZOL, and study the code as a beginner. thanks again.

@wangdbang
Copy link

@ahrens and @behlendorf , if you have a patch, i would like to verify it.

@wangdbang
Copy link

@ahrens and @behlendorf , just evaluate zvol_write performance, My rough thought like this, add a dnode point variable zv_dn into zvol_state, assign the value through dnode_hold at first call in zvol_write, and then call dmu_tx_hold_write_by_dnode to replace the dmu_tx_hold_write, is that possible? if it is, i'll try it.

@behlendorf
Copy link
Contributor

@wangdbang adding zv_dn into zvol_state_t makes sense but you'll want to take the hold it in zvol_setup_zv() and release it in zvol_shutdown_zv(). You'll then be able to use zv->zv_dn in zvol_write() for the dmu_tx_hold_write_by_dnode() call.

When benchmarking the writes I'd suggest using a small block size since that's the workload we'd expect to benefit most from this optimization. Thanks for looking at this!

@wangdbang
Copy link

@behlendorf ,thanks for your suggestion, i'll try it, and share the result if have any progress.

@wangdbang
Copy link

wangdbang commented Feb 17, 2017

@behlendorf and @ahrens , here is the rough patch base on v0.7.0-rc3

diff -urN zfs-0.7.0.orig/include/sys/dmu.h zfs-0.7.0/include/sys/dmu.h
--- zfs-0.7.0.orig/include/sys/dmu.h    2017-01-21 02:31:52.000000000 +0800
+++ zfs-0.7.0/include/sys/dmu.h 2017-02-17 16:50:09.007384464 +0800
@@ -674,6 +674,8 @@

 dmu_tx_t *dmu_tx_create(objset_t *os);
 void dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len);
+int dmu_tx_dnode_hold(objset_t *os, uint64_t object, dnode_t **dnp);
+void dmu_tx_dnode_rele(dnode_t *dn);
 void dmu_tx_hold_write_by_dnode(dmu_tx_t *tx, dnode_t *dn, uint64_t off,
     int len);
 void dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off,
diff -urN zfs-0.7.0.orig/module/zfs/dmu_tx.c zfs-0.7.0/module/zfs/dmu_tx.c
--- zfs-0.7.0.orig/module/zfs/dmu_tx.c  2017-01-21 02:36:28.000000000 +0800
+++ zfs-0.7.0/module/zfs/dmu_tx.c       2017-02-17 16:49:52.934384295 +0800
@@ -453,6 +453,18 @@
        dmu_tx_count_dnode(txh);
 }

+int
+dmu_tx_dnode_hold(objset_t *os, uint64_t object, dnode_t **dnp)
+{
+        return (dnode_hold(os, object, FTAG, dnp));
+}
+
+void
+dmu_tx_dnode_rele(dnode_t *dn)
+{
+        dnode_rele(dn, FTAG);
+}
+
 void
 dmu_tx_hold_write_by_dnode(dmu_tx_t *tx, dnode_t *dn, uint64_t off, int len)
 {
diff -urN zfs-0.7.0.orig/module/zfs/zvol.c zfs-0.7.0/module/zfs/zvol.c
--- zfs-0.7.0.orig/module/zfs/zvol.c    2017-01-21 02:36:28.000000000 +0800
+++ zfs-0.7.0/module/zfs/zvol.c 2017-02-17 18:00:07.759383322 +0800
@@ -89,6 +89,7 @@
        struct hlist_node       zv_hlink;       /* hash link */
        atomic_t                zv_suspend_ref; /* refcount for suspend */
        krwlock_t               zv_suspend_lock;        /* suspend lock */
+       dnode_t                 *zv_dn;
 };

 typedef enum {
@@ -655,7 +656,7 @@
                if (bytes > volsize - off)      /* don't write past the end */
                        bytes = volsize - off;

-               dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);
+               dmu_tx_hold_write_by_dnode(tx, zv->zv_dn, off, bytes);

                /* This will only fail for ENOSPC */
                error = dmu_tx_assign(tx, TXG_WAIT);
@@ -980,6 +981,10 @@
        if (error)
                return (SET_ERROR(error));

+        error = dmu_tx_dnode_hold(os, ZVOL_OBJ, &zv->zv_dn);
+        if (error)
+                return (SET_ERROR(error));
+
        set_capacity(zv->zv_disk, volsize >> 9);
        zv->zv_volsize = volsize;
        zv->zv_zilog = zil_open(os, zvol_get_data);
@@ -1008,6 +1013,8 @@
        dmu_buf_rele(zv->zv_dbuf, zv);
        zv->zv_dbuf = NULL;

+       dmu_tx_dnode_rele(zv->zv_dn);
+
        /*
         * Evict cached data
         */

@wangdbang
Copy link

Here is the 8k test result with orion.

  1. envrionment setup: 8 X INTEL SSDSC2BB80
    RAIDZ1 command:zpool create -f POOL00 raidz sdb sdc sdd sde sdf sdg sdh sdi -o ashift=12
    ZVOL command:zfs create -b 8k -V 200GB POOL00/VOL00

  2. test command: orion -run advanced -testname sd -num_disks 32 -size_small 8 -type rand -cache_size 0 -matrix row -num_large 0 -write 100

  3. test config: add "/dev/zd0" into sd.lun

  4. v0.7.0-rc3 results:
    turn 1st:
    Maximum Small IOPS=12709 @ Small=1 and Large=0
    Minimum Small Latency=0.08 @ Small=1 and Large=0

turn 2nd:
Maximum Small IOPS=9355 @ Small=3 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

turn 3rd:
Maximum Small IOPS=9091 @ Small=3 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

add the dnode to zvol_state to avoid dnode_hold per zvol_write calling, apply the patch.
create the pool and volume again
turn 1st:
Maximum Small IOPS=12678 @ Small=1 and Large=0
Minimum Small Latency=0.08 @ Small=1 and Large=0

turn 2nd:
Maximum Small IOPS=9121 @ Small=4 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

turn 3rd:
Maximum Small IOPS=8990 @ Small=1 and Large=0
Minimum Small Latency=0.11 @ Small=1 and Large=0

@wangdbang
Copy link

@ahrens and @behlendorf ,It seems that the performance has a little reduce, would you like to give some suggestion? thanks.

wli5 pushed a commit to wli5/zfs that referenced this issue Feb 28, 2017
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by 
(objset_t *, uint64_t object).  This change converts some but
not all of the existing consumers.  As performance-sensitive
code paths are discovered they should be converted to use
these routines.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes openzfs#5534 
Issue openzfs#4802
wli5 pushed a commit to wli5/zfs that referenced this issue Feb 28, 2017
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object).  This change converts some but
not all of the existing consumers.  As performance-sensitive
code paths are discovered they should be converted to use
these routines.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes openzfs#5534
Issue openzfs#4802
@wangdbang
Copy link

@wli5, should i use the code that your give the link?

@wli5
Copy link
Contributor

wli5 commented Mar 21, 2017

@wangdbang I didn't contribute to that code.

@behlendorf
Copy link
Contributor

@tuxoko and @ryao who've recently been looking at the zvol might be interested in this.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 21, 2017

The zvol number looks close enough that it looks like within marginal of error. One thing to note is that we already use zv_dbuf to hold the bonus buffer for similar reason, so if we switch to zv_dn, we should get rid of zv_dbuf.

@behlendorf behlendorf modified the milestones: 0.8.0, 0.9.0 Aug 3, 2018
@behlendorf behlendorf removed this from the 0.9.0 milestone Nov 11, 2019
@stale
Copy link

stale bot commented Nov 11, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 11, 2020
@behlendorf
Copy link
Contributor

This issue was resolved by #6058 and #10184.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Inactive Not being actively updated Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants
@tuxoko @behlendorf @ahrens @wangdbang @wli5 and others