Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmu_tx_hold_zap() does dnode_hold() 7x on same object #4641

Closed
ahrens opened this issue May 12, 2016 · 1 comment
Closed

dmu_tx_hold_zap() does dnode_hold() 7x on same object #4641

ahrens opened this issue May 12, 2016 · 1 comment
Labels
Type: Performance Performance improvement or performance problem

Comments

@ahrens
Copy link
Member

ahrens commented May 12, 2016

Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, and with a workaround for #4636, I observed poor performance (~30,000 file creations per second). I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held):

image

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created:

  • dmu_buf_hold_noread()
  • dmu_buf_hold()
  • zap_lookup()
  • zap_lookup_norm()
  • zap_count_write()
  • zap_lockdir()
  • zap_count_write()

A prototype implementation has shown that this can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around #4636. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds.

Once #4636 is fixed, the reward for fixing this issue is high, and the cost is low. Although the code changes are spread among many functions, they are quite straightforward to make and to understand. There is no risk of hurting performance in other use cases. We may find that this “… by dnode” technique can be applied to other use cases as well.

@ahrens
Copy link
Member Author

ahrens commented May 27, 2016

OpenZFS-illumos pull request: openzfs/openzfs#109

nedbass pushed a commit to nedbass/zfs that referenced this issue May 28, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Upstream bugs: DLPX-44797

Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
ZFSonLinux-issue: openzfs#4641
OpenZFS-commit: unmerged

Porting notes:
- Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused
  variable error. Code may be refactored in future upstream revision to
  clean this up.
- Changed EXPORT_SYMBOL(zap_count_write) to
  EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
nedbass pushed a commit to nedbass/zfs that referenced this issue May 28, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Upstream bugs: DLPX-44797

Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
ZFSonLinux-issue: openzfs#4641
OpenZFS-commit: unmerged

Porting notes:
- Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused
  variable error. Code may be refactored in future upstream revision to
  clean this up.
- Changed EXPORT_SYMBOL(zap_count_write) to
  EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
nedbass pushed a commit to nedbass/zfs that referenced this issue Jun 24, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Upstream bugs: DLPX-44797

Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
ZFSonLinux-issue: openzfs#4641
OpenZFS-commit: unmerged

Porting notes:
- Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused
  variable error. Code may be refactored in future upstream revision to
  clean this up.
- Changed EXPORT_SYMBOL(zap_count_write) to
  EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
behlendorf pushed a commit to LLNL/zfs that referenced this issue Jun 27, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Upstream bugs: DLPX-44797

Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
ZFSonLinux-issue: openzfs#4641
OpenZFS-commit: unmerged

Porting notes:
- Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused
  variable error. Code may be refactored in future upstream revision to
  clean this up.
- Changed EXPORT_SYMBOL(zap_count_write) to
  EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
tuomari pushed a commit to tuomari/zfs that referenced this issue Jun 30, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Upstream bugs: DLPX-44797

Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
ZFSonLinux-issue: openzfs#4641
OpenZFS-commit: unmerged

Porting notes:
- Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused
  variable error. Code may be refactored in future upstream revision to
  clean this up.
- Changed EXPORT_SYMBOL(zap_count_write) to
  EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Jul 15, 2016
tuomari pushed a commit to tuomari/zfs that referenced this issue Jul 17, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Upstream bugs: DLPX-44797

Ported by: Ned Bass <bass6@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7004
ZFSonLinux-issue: openzfs#4641
OpenZFS-commit: unmerged

Porting notes:
- Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused
  variable error. Code may be refactored in future upstream revision to
  clean this up.
- Changed EXPORT_SYMBOL(zap_count_write) to
  EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
ahrens added a commit to ahrens/zfs that referenced this issue Aug 15, 2016
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes openzfs#4641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

2 participants