Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc_reclaim / arc_prune out of control after running long rsync #4726

Closed
jonathanvaughn opened this issue Jun 4, 2016 · 29 comments
Closed
Milestone

Comments

@jonathanvaughn
Copy link

This may be related to similar issues such as #4345 or #4239, but those didn't seem to quite match what I'm experiencing.

We've upgraded several machines recently from CentOS 6 to 7, and are using ZFS. I haven't seen this issue on all machines, only the most recent one, so I'm not sure what the issue is.

When performing a large rsync (many GB and hundreds of thousands of files, but not performing a checksum comparison so its mostly stat performance not read), arc fills up and the machine becomes nearly unresponsive due to arc_reclaim / arc_prune.

The dataset is set to primarycache=metadata so I wouldn't have expected it to fill faster than it could be dealt with. I had arc max set to 8GB (32GB machine) and in order to get it responsive again I had to increase arc to 12GB. Even after doing so, even with the machine idle, it's been well over half an hour and arc_reclaim / arc_prune are still running, even though there's more than 3GB free arc at this point (it grew past 8GB). I tried limiting metadata size to see if that would help, but it didn't seem to matter.

Running kernel 4.6.0-1.el7.elrepo.x86_64 on centos7, ZFS version v0.6.5.7-1 (same as the other machines which have had no issues).

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
19:49:56    82    53     64    51   63     2  100    53   64   8.7G   12G
ZFS Subsystem Report                            Fri Jun 03 19:49:22 2016
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                570.61k
        Mutex Misses:                           5.58k
        Evict Skips:                            5.58k

ARC Size:                               70.46%  8.45    GiB
        Target Size: (Adaptive)         100.00% 12.00   GiB
        Min Size (Hard Limit):          4.17%   512.00  MiB
        Max Size (High Water):          24:1    12.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       50.21%  6.03    GiB
        Frequently Used Cache Size:     49.79%  5.97    GiB

ARC Hash Breakdown:
        Elements Max:                           412.63k
        Elements Current:               34.67%  143.05k
        Collisions:                             93.65k
        Chain Max:                              3
        Chains:                                 1.99k

ARC Total accesses:                                     31.60m
        Cache Hit Ratio:                82.35%  26.02m
        Cache Miss Ratio:               17.65%  5.58m
        Actual Hit Ratio:               49.50%  15.64m

        Data Demand Efficiency:         75.67%  1.47m
        Data Prefetch Efficiency:       63.21%  3.16k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             39.65%  10.32m
          Most Recently Used:           14.17%  3.69m
          Most Frequently Used:         45.95%  11.96m
          Most Recently Used Ghost:     0.24%   62.46k
          Most Frequently Used Ghost:   0.00%   115

        CACHE HITS BY DATA TYPE:
          Demand Data:                  4.26%   1.11m
          Prefetch Data:                0.01%   2.00k
          Demand Metadata:              55.85%  14.53m
          Prefetch Metadata:            39.88%  10.38m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  6.39%   356.75k
          Prefetch Data:                0.02%   1.16k
          Demand Metadata:              91.17%  5.09m
          Prefetch Metadata:            2.42%   134.89k


File-Level Prefetch: (HEALTHY)
DMU Efficiency:                                 45.63m
        Hit Ratio:                      92.22%  42.08m
        Miss Ratio:                     7.78%   3.55m

        Colinear:                               3.55m
          Hit Ratio:                    0.04%   1.39k
          Miss Ratio:                   99.96%  3.55m

        Stride:                                 41.74m
          Hit Ratio:                    99.99%  41.74m
          Miss Ratio:                   0.01%   5.31k

DMU Misc:
        Reclaim:                                3.55m
          Successes:                    0.35%   12.56k
          Failures:                     99.65%  3.54m

        Streams:                                345.91k
          +Resets:                      0.01%   36
          -Resets:                      99.99%  345.87k
          Bogus:                                0


ZFS Tunable:
        metaslab_debug_load                               0
        zfs_arc_min_prefetch_lifespan                     0
        zfetch_max_streams                                8
        zfs_nopwrite_enabled                              1
        zfetch_min_sec_reap                               2
        zfs_dbgmsg_enable                                 0
        zfs_dirty_data_max_max_percent                    25
        zfs_arc_p_aggressive_disable                      1
        spa_load_verify_data                              1
        zfs_zevent_cols                                   80
        zfs_dirty_data_max_percent                        10
        zfs_sync_pass_dont_compress                       5
        l2arc_write_max                                   8388608
        zfs_vdev_scrub_max_active                         2
        zfs_vdev_sync_write_min_active                    10
        zvol_prefetch_bytes                               131072
        metaslab_aliquot                                  524288
        zfs_no_scrub_prefetch                             0
        zfs_arc_shrink_shift                              0
        zfetch_block_cap                                  256
        zfs_txg_history                                   0
        zfs_delay_scale                                   500000
        zfs_vdev_async_write_active_min_dirty_percent     30
        metaslab_debug_unload                             0
        zfs_read_history                                  0
        zvol_max_discard_blocks                           16384
        zfs_recover                                       0
        l2arc_headroom                                    2
        zfs_deadman_synctime_ms                           1000000
        zfs_scan_idle                                     50
        zfs_free_min_time_ms                              1000
        zfs_dirty_data_max                                3358756864
        zfs_vdev_async_read_min_active                    1
        zfs_mg_noalloc_threshold                          0
        zfs_dedup_prefetch                                0
        zfs_vdev_max_active                               1000
        l2arc_write_boost                                 8388608
        zfs_resilver_min_time_ms                          3000
        zfs_vdev_async_write_max_active                   10
        zil_slog_limit                                    1048576
        zfs_prefetch_disable                              0
        zfs_resilver_delay                                2
        metaslab_lba_weighting_enabled                    1
        zfs_mg_fragmentation_threshold                    85
        l2arc_feed_again                                  1
        zfs_zevent_console                                0
        zfs_immediate_write_sz                            32768
        zfs_dbgmsg_maxsize                                4194304
        zfs_free_leak_on_eio                              0
        zfs_deadman_enabled                               1
        metaslab_bias_enabled                             1
        zfs_arc_p_dampener_disable                        1
        zfs_object_mutex_size                             64
        zfs_metaslab_fragmentation_threshold              70
        zfs_no_scrub_io                                   0
        metaslabs_per_vdev                                200
        zfs_dbuf_state_index                              0
        zfs_vdev_sync_read_min_active                     10
        metaslab_fragmentation_factor_enabled             1
        zvol_inhibit_dev                                  0
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_vdev_cache_size                               0
        zfs_vdev_mirror_switch_us                         10000
        zfs_dirty_data_sync                               67108864
        spa_config_path                                   /etc/zfs/zpool.cache
        zfs_dirty_data_max_max                            8396892160
        zfs_arc_lotsfree_percent                          10
        zfs_zevent_len_max                                128
        zfs_scan_min_time_ms                              1000
        zfs_arc_sys_free                                  0
        zfs_arc_meta_strategy                             1
        zfs_vdev_cache_bshift                             16
        zfs_arc_meta_adjust_restarts                      4096
        zfs_max_recordsize                                1048576
        zfs_vdev_scrub_min_active                         1
        zfs_vdev_read_gap_limit                           32768
        zfs_arc_meta_limit                                8589934592
        zfs_vdev_sync_write_max_active                    10
        l2arc_norw                                        0
        zfs_arc_meta_prune                                10000
        metaslab_preload_enabled                          1
        l2arc_nocompress                                  0
        zvol_major                                        230
        zfs_vdev_aggregation_limit                        131072
        zfs_flags                                         0
        spa_asize_inflation                               24
        zfs_admin_snapshot                                0
        l2arc_feed_secs                                   1
        zio_taskq_batch_pct                               75
        zfs_sync_pass_deferred_free                       2
        zfs_disable_dup_eviction                          0
        zfs_arc_grow_retry                                0
        zfs_read_history_hits                             0
        zfs_vdev_async_write_min_active                   1
        zfs_vdev_async_read_max_active                    3
        zfs_scrub_delay                                   4
        zfs_delay_min_dirty_percent                       60
        zfs_free_max_blocks                               100000
        zfs_vdev_cache_max                                16384
        zio_delay_max                                     30000
        zfs_top_maxinflight                               32
        spa_slop_shift                                    5
        zfs_vdev_write_gap_limit                          4096
        spa_load_verify_metadata                          1
        spa_load_verify_maxinflight                       10000
        l2arc_noprefetch                                  1
        zfs_vdev_scheduler                                noop
        zfs_expire_snapshot                               300
        zfs_sync_pass_rewrite                             2
        zil_replay_disable                                0
        zfs_nocacheflush                                  0
        zfs_arc_max                                       12884901888
        zfs_arc_min                                       536870912
        zfs_read_chunk_size                               1048576
        zfs_txg_timeout                                   5
        zfs_pd_bytes_max                                  52428800
        l2arc_headroom_boost                              200
        zfs_send_corrupt_data                             0
        l2arc_feed_min_ms                                 200
        zfs_arc_meta_min                                  0
        zfs_arc_average_blocksize                         8192
        zfetch_array_rd_sz                                1048576
        zfs_autoimport_disable                            1
        zfs_arc_p_min_shift                               0
        zio_requeue_io_start_cut_in_line                  1
        zfs_vdev_sync_read_max_active                     10
        zfs_mdcomp_disable                                0
        zfs_arc_num_sublists_per_state                    8
NAME                                                                                                USED  AVAIL  REFER  MOUNTPOINT
REDACTED_pool0                                                                                        663G  1.11T    96K  /REDACTED_pool0
REDACTED_pool0/ROOT                                                                                  2.18G  1.11T  2.18G  /REDACTED_pool0/ROOT
REDACTED_pool0/atlassian-data                                                                        68.0G  1.11T  68.0G  /var/atlassian
REDACTED_pool0/atlassian-opt                                                                         3.06G  1.11T  3.06G  /opt/atlassian
REDACTED_pool0/docker                                                                                3.93M  1.11T  3.93M  /var/docker
REDACTED_pool0/docker-storage                                                                         932M  1.11T  3.93M  /var/lib/docker
REDACTED_pool0/docker-storage/0056857a64482466f0237a0151573bf9d25978be72309cffd2469ff2a7c563db        356K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/00ac923f6cd89b24efa1b6e6c979f6742e1d7254d55d36e9a4714b2227cfe425        756K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/012e103c0fcc67724a911f0c498a021a56f68f9e81d8dff7efcfbfae98514329        148K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/0480ce2f1b482c7de707a34af5b5939ef5766d36e418eb0851255529186b9250        140K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/0baabef050bdca13b0d07a49fd2970e4fe539d7508f4baf1620322e2e52db0cb        132K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/106eb69bddb28b4cd151964833d657f35172b48a0806f8a08c9e812768295dc1         96K  1.11T   337M  legacy
REDACTED_pool0/docker-storage/184df186e387e4bc17ac9bc1533480b57410215edd76f59c99515baff97137c7        636K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/1a606111d2307b93d61263375b1d497f2f1448d0782fe605bd5dba3bc7c9a462        232K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/1e3826605d1afdbd80412f04346cdbea14259483e77c785afd760830417ab1b3        140K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/1feff74a8ca94e76ec99a5760dde6bd4bd4b2d5422414f572a4af1d65bba5571       5.77M  1.11T   377M  legacy
REDACTED_pool0/docker-storage/27edf3335bc0054d7f445ffa3f5f6ab9c0b4489f930dbe7d01cb102c1c2ebecb        140K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/2897462c8ec5deb33b59a63611d4b3ec0b0026235410aac961fbfa4d0413ca11        140K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/32b6044561f0a6021fd0ef6a3fb80f882ccd8b6430f84c3499327477422099f7        112K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/35c32fbbb9a659ad206f549b6c110044d69c0f3ebe4a10cf3ffa85054d7229b8        112K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/35eb4d2ea04eefa3cff4a64ac68916eeab8f75c899b9faf6e8b0b690241140fb        148K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/36ed577a48877e1a8efb8af3b197090c8b8a6c0920608cd9c1867d8ea9260ff0        156K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/3dc93a1d00d332f23afac47250baf0494704d60c2aeeee62852e99b843657b51       7.47M  1.11T   125M  legacy
REDACTED_pool0/docker-storage/3e4c71cebc543f819b35527786c04ae3499c9c5fabc62080e84b4bb5a66e4b03        104K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/3efa2aa1e0582899d6faaef860e4781f828626ab8bbbe978342a76d97fc82f68        140K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/445e327c7b229541de7860c31a096764cd6a420fd609d05572b3a4d5bb1edd0e        800K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/445e327c7b229541de7860c31a096764cd6a420fd609d05572b3a4d5bb1edd0e-init   176K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/46448c001a786f0c58148511ec8ec6dbb264e5bff981b1ed326f94d2e86bcd7c        356K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/47d44cb6f252ea4f6aecf8a447972de5d9f9f2e2bec549a2f1d8f92557f4d05a        104K  1.11T    96K  legacy
REDACTED_pool0/docker-storage/49ad0c973296a46ef7fcc70a24fe4640dfba286076e9302b5cdf828abdd13501        132K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/4e9f14010a5270b899c038a6e9cc9d4b6206422bc625f2470df41cec44a9cce0        232K  1.11T   398M  legacy
REDACTED_pool0/docker-storage/4f5c015cca28acd98b93a99c2d0695a0f6068a0b972a6d999901f715d44f4508       61.9M  1.11T   391M  legacy
REDACTED_pool0/docker-storage/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158        104K  1.11T    96K  legacy
REDACTED_pool0/docker-storage/5156bebbccf9cc966f8fd16cf277340c609fe8970f68273dc52d4894e50498b1        185M  1.11T   362M  legacy
REDACTED_pool0/docker-storage/5764f0a3131791360948d70cc2714226a1ec786675d27e09348abd4adecb03ea        104K  1.11T   125M  legacy
REDACTED_pool0/docker-storage/594cd51330f70859dd1dd919f3f2d3ab89f0d5f55153eb829bb084524054eadb       5.75M  1.11T   399M  legacy
REDACTED_pool0/docker-storage/5ae46cd541a55793c3963478066abf7b9c2be42ab6864c8c1618e5021306c3b4        148K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/5b12ef8fd57065237a6833039acc0e7f68e363c15d8abb5cacce7143a1f7de8a         72K  1.11T    96K  legacy
REDACTED_pool0/docker-storage/5ba63a7ec5d003675f6c0f692807be7d131c395f35c89386285d807e1caedeca       4.49M  1.11T   393M  legacy
REDACTED_pool0/docker-storage/5d3dc7393d4ae5a85678ace6b8e1bfa7404c56199d94b7270c5978857d6bb8f6        156K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/60e65a8e4030022260a4f84166814b2683e1cdfc9725a9c262e90ba9c5ae2332        104K  1.11T   125M  legacy
REDACTED_pool0/docker-storage/6856d39a282fe617098075475f2a857a2a297645b91cfe4dc3bbf0fb25cca214        984K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/6856d39a282fe617098075475f2a857a2a297645b91cfe4dc3bbf0fb25cca214-init   176K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/6ae1c96016a58e20db8987812a10d452171b9376c38c043f17f420d498a23b17        112K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/6cec9a305daa7981fde96f6633443a3c78e4d8385fa08e7ab0e18d3377784d35        148K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/7080051af98969ef315b1579f289aa1a1dd99834922ae022a726d6468a8db880       2.36M  1.11T   372M  legacy
REDACTED_pool0/docker-storage/71098c77ff51f5f2d22f09d2913f2a933f92618d55ff4cd1f3e4c750de7e3cad        788K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/79403fcabcbce6f2f1f33183c05caeba3395601c42310b6cca49aa4ba79dd58c       38.7M  1.11T   389M  legacy
REDACTED_pool0/docker-storage/838c1c5c4f833fda62e928de401303d293d23d52c831407b12edd95ca3f1839e        125M  1.11T   125M  legacy
REDACTED_pool0/docker-storage/84407cfa9ecdb7463512414e4a8fe1a6fba690f0b71b9b52a18b0a72d3410742        104K  1.11T   125M  legacy
REDACTED_pool0/docker-storage/8b83cd0a5724dc8d4c6c74c7a14b546f58f6aab8eecdf8bff711f3f83a4a4d03        140K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/8f53bed55de21acc781996cbbf664e3fae213e575d1d13a8e491b5178396b3f7        164M  1.11T   337M  legacy
REDACTED_pool0/docker-storage/9ca1d808cb59542f9e1510bbb47e7e138c0e29908c0dfd5e7bb68e7d2b959644       1.06M  1.11T   377M  legacy
REDACTED_pool0/docker-storage/9ca1d808cb59542f9e1510bbb47e7e138c0e29908c0dfd5e7bb68e7d2b959644-init   232K  1.11T   377M  legacy
REDACTED_pool0/docker-storage/a6c4f4c57f963c985747361f7b96948161ee588feb680de40ad966fe8c2e65d4        104K  1.11T   377M  legacy
REDACTED_pool0/docker-storage/b1b7f0f37901c473c30ded439bdd84d39cd5750721873fa821d94493dfa695d4       71.6M  1.11T   187M  legacy
REDACTED_pool0/docker-storage/b3a2c44bb00bee64a0d0c42e8643646eb41df8cb38960b69ed318bc0533a72e4        112K  1.11T   133M  legacy
REDACTED_pool0/docker-storage/b448af7556ed98e73b8439965f7b53d527f1f178281e25ed6b4fe9e9d2e83564       1.86M  1.11T   393M  legacy
REDACTED_pool0/docker-storage/b5ced8b2946c34da022c8d7fe1fc57528ce6b15b3ad2fbed01c79f5e69d00d76        104K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/ca5966735c535a0a9150496df6badd5da1187384b53b3debf28d0b00b80bceb5       5.80M  1.11T   398M  legacy
REDACTED_pool0/docker-storage/cbdb5fea83decdf0e9a3cd3e9495550fe5022abcd4b8c6b5620dddfe831df4de       4.71M  1.11T   133M  legacy
REDACTED_pool0/docker-storage/cc0cb24fc14d3ce6e75d6c4818632add7972bea82371f25a2fbefc6190ca2e16        148K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/ced21ad71cbe6b06e41d95192c4b27b330f37bdba0a5ae32df571115601d1a88        240K  1.11T   377M  legacy
REDACTED_pool0/docker-storage/d3bae8310b00855a43751fa7ed47ad35581cb32e5f6e61d3835bd16be86d8211        416K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/d9ecc821c7ccba515c5040a33f0b6e8a9eb419775c9474687659f350022961df        112K  1.11T   362M  legacy
REDACTED_pool0/docker-storage/e005ab5515593115e9e4953fc3fbc5827f9182b28774b3055148eca822daf479       42.4M  1.11T   370M  legacy
REDACTED_pool0/docker-storage/e355356651cc59ae780a08a19ae45180d9ab62c1618652e7671dfbc27fe5148f        132K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/ed454ebfb7d17bd061730e68cd42028b3d26b9e341e5bd668b75b506e5ccb012        112K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/f6808a3e4d9e80a655ec625e38b869ed8a614611e4d0073aeff23be841c9fcff        133M  1.11T   133M  legacy
REDACTED_pool0/docker-storage/f75d9f29b700f0dfa47012b53061affbdf47b2f111ea15d336bbcc0eba3c370a        112K  1.11T   362M  legacy
REDACTED_pool0/docker-storage/fb72f570f4b4da69d5903190750b1f6225b1547a705a4ab5bef8bcec9507a5c1       56.9M  1.11T   183M  legacy
REDACTED_pool0/share_REDACTED                                                                       3.56G  1.11T  3.56G  /usr/share/REDACTED
REDACTED_pool0/swap                                                                                  34.0G  1.14T    64K  -
REDACTED_pool0/www                                                                                    551G  1.11T  1.32M  /var/www
REDACTED_pool0/www/mediarepo-local                                                                   26.4G  1.11T  26.4G  /var/www/mediarepo-local
REDACTED_pool0/www/preview.REDACTED.com                                                              281G  1.11T   281G  /var/www/preview.REDACTED.com
REDACTED_pool0/www/qa.REDACTED.net                                                                   243G  1.11T   243G  /var/www/qa.REDACTED.net
@dweeezil
Copy link
Contributor

dweeezil commented Jun 4, 2016

@jonathanvaughn Could you please post the full /proc/spl/kstat/zfs/arcstats when this is happening. I suspect there are other process competing for memory. By default, on a 32GiB system, ZoL is going to want to keep at least 512MiB free memory in the system. This behavior can be adjusted with the newish zfs_arc_sys_free tunable if necessary.

@jonathanvaughn
Copy link
Author

jonathanvaughn commented Jun 4, 2016

I've tried turning primarycache=all on for most of the datasets in case it was some weird behavior with only metadata being cachable, it still happens.

Here's the output from free -m:

              total        used        free      shared  buff/cache   available
Mem:          32031         884       13880          51       17266       14399
Swap:         32767           0       32767

As you can see, there's plenty of it at the moment.

Here's the output you wanted:

6 1 0x01 91 4368 1579813499 38581696671350
name                            type data
hits                            4    21401257
misses                          4    3077974
demand_data_hits                4    119714
demand_data_misses              4    52075
demand_metadata_hits            4    12301614
demand_metadata_misses          4    2877192
prefetch_data_hits              4    7
prefetch_data_misses            4    409
prefetch_metadata_hits          4    8979922
prefetch_metadata_misses        4    148298
mru_hits                        4    4113558
mru_ghost_hits                  4    6889
mfu_hits                        4    8307770
mfu_ghost_hits                  4    126
deleted                         4    256144
mutex_miss                      4    796
evict_skip                      4    59340458
evict_not_enough                4    14941394
evict_l2_cached                 4    0
evict_l2_eligible               4    1496560128
evict_l2_ineligible             4    289325056
evict_l2_skip                   4    0
hash_elements                   4    135033
hash_elements_max               4    364571
hash_collisions                 4    18243
hash_chains                     4    1395
hash_chain_max                  4    3
p                               4    4401385984
c                               4    8589934592
c_min                           4    536870912
c_max                           4    8589934592
size                            4    8629517472
hdr_size                        4    57255552
data_size                       4    0
metadata_size                   4    2192923136
other_size                      4    6379338784
anon_size                       4    32768
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    2159458816
mru_evictable_data              4    0
mru_evictable_metadata          4    81920
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    33431552
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    1261
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    684374023
arc_meta_used                   4    8629517472
arc_meta_limit                  4    8589934592
arc_meta_max                    4    8629603384
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    524804096

Top:

top - 16:22:40 up 10:49,  3 users,  load average: 5.80, 5.71, 5.59
Tasks: 689 total,   6 running, 680 sleeping,   3 stopped,   0 zombie
%Cpu(s):  0.0 us, 42.2 sy,  0.0 ni, 57.4 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32800360 total, 14204412 free,   915028 used, 17680920 buff/cache
KiB Swap: 33554428 total, 33554428 free,        0 used. 14736192 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  269 root      20   0       0      0      0 R  99.0  0.0  47:30.23 arc_reclaim
  262 root      20   0       0      0      0 S  35.4  0.0  16:32.77 arc_prune
  267 root      20   0       0      0      0 S  35.4  0.0  16:39.98 arc_prune
  263 root      20   0       0      0      0 R  35.1  0.0  16:06.82 arc_prune
  264 root      20   0       0      0      0 R  35.1  0.0  16:22.74 arc_prune
  268 root      20   0       0      0      0 R  35.1  0.0  16:08.26 arc_prune
  261 root      20   0       0      0      0 S  34.8  0.0  16:27.70 arc_prune
  265 root      20   0       0      0      0 R  34.8  0.0  16:14.10 arc_prune
  266 root      20   0       0      0      0 S  34.8  0.0  16:18.26 arc_prune
 7165 root      20   0  158372   3580   2332 R   0.7  0.0   0:00.71 top
    7 root      20   0       0      0      0 S   0.3  0.0   0:53.41 rcu_sched
 1364 root       1 -19       0      0      0 S   0.3  0.0   0:02.37 z_wr_iss
 7184 root      20   0       0      0      0 S   0.3  0.0   0:00.03 kworker/6:1
    1 root      20   0   41572   4208   2496 S   0.0  0.0   0:03.20 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.03 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.35 ksoftirqd/0

@jonathanvaughn
Copy link
Author

The pool status

  pool: REDACTED_pool0
 state: ONLINE
  scan: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        REDACTED_pool0                             ONLINE       0     0     0
          mirror-0                                 ONLINE       0     0     0
            ata-ST2000DM001-1CH164_S1E1QXHZ-part2  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_S1E1R42C-part2  ONLINE       0     0     0

errors: No known data errors

@dweeezil
Copy link
Contributor

dweeezil commented Jun 5, 2016

@jonathanvaughn Your system is having trouble keeping the metadata under the limit and its not showing much evictable memory. Try setting the tunable zfs_arc_meta_strategy to zero and see if the traditional metadata-only adjuster doesn't work better.

@jonathanvaughn
Copy link
Author

This seems to be an improvement.

So far, I was able to complete an rsync of one dataset (remote machine not ZFS, so not ZFS send/recv) without it locking up. arc_reclaim is still using about 60% CPU but I don't see a bunch of arc_prune in addition and the system is responsive still with load only ~1. ARC size was just a bit over the limit (8.04GB vs 8GB) and I've started another rsync of another dataset and it is growing beyond the limit still (currently at 9.21GB vs 8GB limit) but the system is responsive thus far and otherwise "working as expected".

@jonathanvaughn
Copy link
Author

jonathanvaughn commented Jun 5, 2016

After those two rsyncs (total files+directories in these datasets are over 7.3million), ARC size is 14.24GB vs 8GB limit set. Setting the datasets to primarycache=metadata and rebooting, and doing all of this again, the result is the same - over 14GB of ARC usage even though the limit is set to 8GB.

So there presumably is a problem preventing the removal of metadata from ARC to stay within the ARC size, but data blocks do get removed from ARC to make room?

@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2016

@jonathanvaughn You might be better off with the default "balanced" strategy. There are a couple of parameters which it uses: zfs_arc_meta_adjust_restarts (default 4096) controls the number of passes made over data & metadata and likely represents all the CPU being used by the arc_reclaim thread. The zfs_arc_meta_prune parameter (default 10000) is the amount of objects it scans and it's doubled on every other pass. Try lowering zfs_arc_meta_adjust_restarts to maybe 4 or 8 and increasing zfs_arc_meta_prune much larger; maybe 100000, 500000 or even 1000000 and see if it doesn't work better.

@jonathanvaughn
Copy link
Author

Trying right now with restarts 8 and prune 100000. So far this seems to be keeping arc_reclaim from going crazy.

Is there a technical reason why there's not some kind of built in abort to the reclaim loop if it's taking more than a some sane amount of time to finish reclaiming?

Either way, if this solves it, great. Eating ~50% of RAM for this server would be not great but survivable (just) but the next server I had on my plate to upgrade OS and go to ZFS on has ~27 million files vs ~7 million, assuming a linear increase in RAM required for metadata it wouldn't even fit in memory. D:

@jonathanvaughn
Copy link
Author

Looks like I spoke too soon / didn't wait long enough. I am going to continue to fiddle with those two settings though and see if some combination works out.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2016

@jonathanvaughn There's not been a report of runaway metadata in awhile so I'm wondering if there's anything else special with your setup. I ran a few of my normal tests and wasn't able to duplicate the situation. I'm going to start by switching to a 4.6 kernel since my main tests system is running 4.4.6 at the moment to make sure it's not something with the kernel. Could you, for example, be using xattrs extensively? Are you using --fake-super?

In balanced mode, it tries to unpin metadata by calling into the Linux superblock shrinker super_cache_scan() in the arc_prune thread by requesting scans of multiples of zfs_arc_meta_prune objects. You can use egrep '^(inode_cache|dentry)' /proc/slabinfo to get an idea how successful it is.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2016

@jonathanvaughn In my initial tests on a 4.6 kernel, it looks like I was able to duplicate the problem. I'm going back to 4.4.6 to make sure it really didn't happen there.

@jonathanvaughn
Copy link
Author

So the reason I didn't initially have total success was I forgot to change the strategy back, and after I did so I was unable to reboot to clear ARC (because our users were coming online). However, over the last ~12 hours those ZFS parameter changes did work, and ARC size is now staying at the ARC maximum. I guess because there were already some arc_reclaim runs that were using the old strategy it took a long time for it to finally switch over, but once it did things are working as expected.

We didn't have any issues on the previous servers, which actually had more data, but they were databases so the file count was far lower (few large vs many small).

I guess we can close this?

@jonathanvaughn
Copy link
Author

jonathanvaughn commented Jun 6, 2016

For whatever reason Github hadn't refreshed so I didn't see your last updates.

I don't think we're using xattrs extensively, but we are using xattr=sa setting. I am not aware of any specific uses of xattrs, but there might be some projects that have used them (so could be some 10's or 100's of thousands of files, out of the millions).

I wasn't using --fake-super (I was running rsync as root on both ends).

# egrep '^(inode_cache|dentry)' /proc/slabinfo
inode_cache        19316  19516    568   28    4 : tunables    0    0    0 : slabdata    697    697      0
dentry            4122640 4126290    192   21    1 : tunables    0    0    0 : slabdata 196490 196490      0

Not sure how relevant the slabinfo is currently since the problem is "fixed".

Current settings for zfs_arc_meta_adjust_restarts is 4 and zfs_arc_meta_prune is 500000

I will try relaxing those to 8 and 100000 later, which is where I started, but since I already had ARC data exceeding the arc size it took time for arc_reclaim to use the new settings and I kept changing things ...

@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2016

@jonathanvaughn This is a regression caused by changes between the 4.5 and 4.6 kernel. I'm working on a patch.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 6, 2016

The problem appears to be the continuing evolution of memory cgroups (memcg). If you boot with cgroup_disable=memory the reclaiming should start working again. I've not worked up a patch yet.

@jonathanvaughn
Copy link
Author

I will try to test this in a few hours when we have no active users of that system.

@jonathanvaughn
Copy link
Author

jonathanvaughn commented Jun 7, 2016

I am starting the test now on the existing server with cgroup_disable=memory setting.

Also building yet another new server, this one so far isn't having any problems even without making those ZFS parameter changes. The only difference are : hard drives (3TB vs 2TB and different brand) which seems unlikely to be related, CPU (previous one was AMD FX-8320, this one I'm building now is FX-8350), and kernel version (the new server is on kernel 4.6.1 vs 4.6.0 of the previous one - since that is now the latest kernel-ml package from ELrepo).

It may be that whatever was "broken" in 4.6.0 has been "fixed" in 4.6.1. I'm letting things run for awhile on both servers to try and verify that (A) cgroup_disable=memory fixes kernel 4.6.0 and (B) kernel 4.6.1 works without any special changes (either kernel boot args or ZFS parameters).

@jonathanvaughn
Copy link
Author

cgroup_disable=memory definately solve the problem under 4.6.0. I am testing that same machine in 4.6.1 now since 4.6.1 is working on the other machine without any special configuration.

@jonathanvaughn
Copy link
Author

Upgrading to 4.6.1 solved it for the machine I originally posted this issue about.

So it looks like whatever "they" broke in kernel 4.6.0 "they" fixed in kernel 4.6.1. So as far as I am personally concerned, there's no need to make a patch for 4.6.0 to solve this.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 8, 2016

The problem is caused by the continued development and integration of memory cgroups. The key commit which caused the ZoL shrinker to stop working is torvalds/linux@b313aee.

The obvious fix is to make the ZoL shrinker memcg-aware similarly to the way in which it was made NUMA-aware. Unfortunately, this doesn't seem possible as far as I can tell because mem_cgroup_iter() isn't exported so some other hackish solution may be required.

I'll note, too, that this is certainly not "fixed" in 4.6.1 mainly because nothing is actually broken.

@jonathanvaughn
Copy link
Author

jonathanvaughn commented Jun 8, 2016

Well, I've repeatedly churned through 3x as much data as would cause the ARC to grow uncontrollably without any further issues, so the problem is at least no longer presenting itself (yet) under the same circumstances it was previously.

@dweeezil
Copy link
Contributor

dweeezil commented Jun 8, 2016

@jonathanvaughn I can't explain why 4.6.1 would work OK. I took a pretty good look at all 101 commits between 4.6 and 4.6.1 and there sure doesn't look like anything that would impact memory cgroups at all. Does your 4.6.1 system still mount it in /sys/fs/cgroup/memory? Does its arc_meta_used always stay well below arc_meta_limit and not overshoot by much?

@jonathanvaughn
Copy link
Author

It overshoots occasionally but not abnormally - i.e. a few percent, and then goes back to the limit.

/sys/fs/cgroup/memory/

total 0
-rw-r--r--   1 root root 0 Jun  7 05:20 cgroup.clone_children
--w--w--w-   1 root root 0 Jun  7 05:20 cgroup.event_control
-rw-r--r--   1 root root 0 Jun  7 05:20 cgroup.procs
-r--r--r--   1 root root 0 Jun  7 05:20 cgroup.sane_behavior
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.failcnt
--w-------   1 root root 0 Jun  7 05:20 memory.force_empty
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.failcnt
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.max_usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.kmem.slabinfo
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.failcnt
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.max_usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.kmem.usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.max_usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.memsw.failcnt
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.memsw.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.memsw.max_usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.memsw.usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.move_charge_at_immigrate
-r--r--r--   1 root root 0 Jun  7 05:20 memory.numa_stat
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.oom_control
----------   1 root root 0 Jun  7 05:20 memory.pressure_level
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.soft_limit_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.stat
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.swappiness
-r--r--r--   1 root root 0 Jun  7 05:20 memory.usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.use_hierarchy
-rw-r--r--   1 root root 0 Jun  7 05:20 notify_on_release
-rw-r--r--   1 root root 0 Jun  7 05:20 release_agent
drwxr-xr-x 137 root root 0 Jun  7 05:20 system.slice
-rw-r--r--   1 root root 0 Jun  7 05:20 tasks
drwxr-xr-x   2 root root 0 Jun  7 05:20 user.slice

@jonathanvaughn
Copy link
Author

For what it's worth, I've built yet another machine today and the kernel-ml package is now 4.6.2. It also appears to work fine so far (holding at no more than a fraction of a % over the ARC limit).

@jonathanvaughn
Copy link
Author

Well, one of the servers (currently on 4.6.1) just had this issue again, though I don't know why. It wasn't having any higher level of IO than normal, nowhere near what happened originally to cause this issue. I did notice that in arcstat.py the 'c' was below the 8GB I set (7.6GB) and the arcsz was around 8GB. There was almost 8GB of free RAM so there shouldn't have been any memory pressure to force 'c' below the 8GB I set it to. I was (eventually) able to login and set the strategy to 0, which immediately caused load to drop, but the arcsz is going up past 8GB. I set restarts to 4 and prune to 500000 and set strategy back to 1 ... hopefully the arcsz stops growing and drops back to the max 8GB I set.

@dweeezil
Copy link
Contributor

This problem is difficult to track down because it depends on the place in the memcg hierarchy at which the process performing the allocations resides. I'm running my test program as a normal logged-in user and on my Ubuntu 14.04 system, my shell appears in the hierarchy at .../memory/user/<uid>.user/<sessno>.session. When my shell, and therefore the testing programs are in this deeper part of the hierarchy, the reclaim doesn't happen. If I move the shell to the root of the hierarchy with echo <pid> >> /proc/fs/cgroup/memory/tasks and run the test, the reclaim works perfectly fine.

ZoL is calling the generic superblock shrinker with a NULL memcg and this is what's stopped working in the newer kernels due to other memcg-related changes. Unfortunately, the support necessary to make ZoL's shrinker wrapper memcg-aware are either not exported from the kernel or are exported GPL-only.

@jonathanvaughn
Copy link
Author

Since there's no telling when these issues will be resolved (either by finding a work around on ZFS' side or providing access to the necessary internals on the kernel side), I guess I'll need to downgrade kernels on these machines.

4.4.13 should be safe? latest in elrepo-kernel repo as kernel-lt. kernel-ml doesn't appear to have any older versions.

@dweeezil
Copy link
Contributor

There really doesn't seem to be a good way of making the shrinker memcg-aware given the interfaces exported from, at least, the 4.6 series of kernels so I think the best solution for now is to fall back to the older d_prune_aliases() scheme if necessary. The patch in f7d22c4 seems to work just fine.

dweeezil added a commit to dweeezil/zfs that referenced this issue Jun 17, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Fixes: openzfs#4726
dweeezil added a commit to dweeezil/zfs that referenced this issue Jun 17, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Fixes: openzfs#4726
@behlendorf behlendorf added this to the 0.7.0 milestone Jun 17, 2016
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jun 21, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jun 21, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
sempervictus pushed a commit to sempervictus/zfs that referenced this issue Jun 26, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
dweeezil added a commit to dweeezil/zfs that referenced this issue Jul 13, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
GeLiXin added a commit to GeLiXin/zfs that referenced this issue Aug 1, 2016
* Consistently use parsable instead of parseable

This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4682

* Add missing RPM BuildRequires

Both libudev and libattr are recommended build requirements.  As
such their development headers should lists in the rpm spec file
so those dependencies are pulled in when building rpm packages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4676

* Skip ctldir znode in zfs_rezget to fix snapdir issues

Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4514
Closes #4661
Closes #4672

* Improve zfs-module-parameters(5)

Various rewrites to the descriptions of module parameters. Corrects
spelling mistakes, makes descriptions them more user-friendly and
describes some ZFS quirks which should be understood before changing
parameter values.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4671

* Fix arc_prune_task use-after-free

arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent
the underlying zsb from disappearing if there's a concurrent umount. We fix
this by force the caller of arc_remove_prune_callback to wait for
arc_prune_taskq to finish.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4687
Closes #4690

* Add request size histograms (-r) to zpool iostat, minor man page fix

Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4659

* OpenZFS 6531 - Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files

* Systemd configuration fixes

* Disable zfs-import-scan.service by default.  This ensures that
pools will not be automatically imported unless they appear in
the cache file.  When this service is explicitly enabled pools
will be imported with the "cachefile=none" property set.  This
prevents the creation of, or update to, an existing cache file.

    $ systemctl list-unit-files | grep zfs
    zfs-import-cache.service                  enabled
    zfs-import-scan.service                   disabled
    zfs-mount.service                         enabled
    zfs-share.service                         enabled
    zfs-zed.service                           enabled
    zfs.target                                enabled

* Change services to dynamic from static by adding an [Install]
section and adding 'WantedBy' tags in favor of 'Requires' tags.
This allows for easier customization of the boot behavior.

* Start the zfs-import-cache.service after the root pivot so
the cache file is available in the standard location.

* Start the zfs-mount.service after the systemd-remount-fs.service
to ensure the root fs is writeable and the ZFS filesystems can
create their mount points.

* Change the default behavior to only load the ZFS kernel modules
in zfs-import-*.service or when blkid(8) detects a pool.  Users
who wish to unconditionally load the kernel modules must uncomment
the list of modules in /lib/modules-load.d/zfs.conf.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4325
Closes #4496
Closes #4658
Closes #4699

* Fix self-healing IO prior to dsl_pool_init() completion

Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization.  This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().

George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.

However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes().  This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.

Signed-off-by: GeLiXin <47034221@qq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4652

* Add isa_defs for MIPS

GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4712

* Fix out-of-bound access in zfs_fillpage

The original code will do an out-of-bound access on pl[] during last
iteration.

 ==================================================================
 BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
 Read of size 8 by task tmpfile/7850
 page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
 flags: 0xffff8000000000()
 page dumped because: kasan: bad access detected
 CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
  ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
  ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
  ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
 Call Trace:
  [<ffffffff81635618>] dump_stack+0x63/0x8b
  [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
  [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
  [<ffffffff813144b8>] kasan_report+0x58/0x60
  [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
  [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]

  [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
  [<ffffffff810058ef>] do_syscall_64+0xef/0x180
  [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
 Memory state around the buggy address:
  ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                 ^
  ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ==================================================================

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4705
Issue #4708

* Fix memleak in zpl_parse_options

strsep() will advance tmp_mntopts, and will change it to NULL on last
iteration.  This will cause strfree(tmp_mntopts) to not free anything.

unreferenced object 0xffff8800883976c0 (size 64):
  comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s)
  hex dump (first 32 bytes):
    72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a  rw.strictatime.z
    66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d  fsutil.mntpoint=
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811f9cac>] __kmalloc+0x16c/0x250
    [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl]
    [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs]
    [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs]
    [<ffffffff81222dc8>] mount_fs+0x38/0x160
    [<ffffffff81240097>] vfs_kern_mount+0x67/0x110
    [<ffffffff812428e0>] do_mount+0x250/0xe20
    [<ffffffff812437d5>] SyS_mount+0x95/0xe0
    [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4706
Issue #4708

* Fix memleak in vdev_config_generate_stats

fnvlist_add_nvlist will copy the contents of nvx, so we need to
free it here.

unreferenced object 0xffff8800a6934e80 (size 64):
  comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s)
  hex dump (first 32 bytes):
    60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff  `..s.....|.s....
    00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff  ........@.p.....
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310
    [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl]
    [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl]
    [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
    [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair]
    [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair]
    [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair]
    [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs]
    [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs]
    [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs]
    [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs]
    [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs]
    [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs]
    [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0
    [<ffffffff812333b9>] SyS_ioctl+0x79/0x90

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4707
Issue #4708

* Linux 4.7 compat: handler->set() takes both dentry and inode

Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4717 
Issue #4665

* Implementation of AVX2 optimized Fletcher-4

New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4330

* Fix cstyle.pl warnings

As of perl v5.22.1 the following warnings are generated:

* Redundant argument in printf at scripts/cstyle.pl line 194

* Unescaped left brace in regex is deprecated, passed through
  in regex; marked by <-- HERE in m/\S{ <-- HERE / at
  scripts/cstyle.pl line 608.

They have been addressed by escaping the left braces and by
providing the correct number of arguments to printf based on
the fmt specifier set by the verbose option.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4723

* Fix minor spelling mistakes

Trivial spelling mistake fix in error message text.

* Fix spelling mistake "adminstrator" -> "administrator"
* Fix spelling mistake "specificed" -> "specified"
* Fix spelling mistake "interperted" -> "interpreted"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4728

* Add `zfs allow` and `zfs unallow` support

ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487

* Remove libzfs_graph.c

The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4766

* Linux 4.6 compat: Fall back to d_prune_aliases() if necessary

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: #4726

* SIMD implementation of vdev_raidz generate and reconstruct routines

This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328

* Fix NFS credential

The commit f74b821 caused a regression where creating file through NFS will
always create a file owned by root. This is because the patch enables the KSID
code in zfs_acl_ids_create, which it would use euid and egid of the current
process. However, on Linux, we should use fsuid and fsgid for file operations,
which is the original behaviour. So we revert this part of code.

The patch also enables secpolicy_vnode_*, since they are also used in file
operations, we change them to use fsuid and fsgid.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4772
Closes #4758

* OpenZFS 6513 - partially filled holes lose birth time

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>a
Ported by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0

If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.

Fix is to modify the dbuf_read code to fill in birth time data.

* Add a test case for dmu_free_long_range() to ztest

Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4754

* Revert "Add a test case for dmu_free_long_range() to ztest"

This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which
introduced a new test case to ztest which is failing occasionally
during automated testing.  The change is being reverted until
the issue can be fully investigated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4754

* OpenZFS 6878 - Add scrub completion info to "zpool history"

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Authored by: Nav Ravindranath <nav@delphix.com>
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6878
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5
Closes #4787

* FreeBSD rS271776 - Persist vdev_resilver_txg changes

Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

Authored-by: smh <smh@FreeBSD.org>
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5154
FreeBSD-issue: https://reviews.freebsd.org/rS271776
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf
Closes #4790

* xattrtest: allow verify with -R and other improvements

- Use a fixed buffer of random bytes when random xattr values are in
  effect.  This eliminates the potential performance bottleneck of
  reading from /dev/urandom for each file. This also allows us to
  verify xattrs in random value mode.

- Show the rate of operations per second in addition to elapsed time
  for each phase of the test. This may be useful for benchmarking.

- Set default xattr size to 6 so that verify doesn't fail if user
  doesn't specify a size. We need at least six bytes to store the
  leading "size=X" string that is used for verification.

- Allow user to execute just one phase of the test. Acceptable
  values for -o and their meanings are:

   1 - run the create phase
   2 - run the setxattr phase
   3 - run the getxattr phase
   4 - run the unlink phase

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Backfill metadnode more intelligently

Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates.  As summarized by
@mahrens in #4636:

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Implement large_dnode pool feature

Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542

* Sync DMU_BACKUP_FEATURE_* flags

Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING.  The
DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and
then reserved in the upstream OpenZFS implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #4795

* OpenZFS 2605, 6980, 6902

2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.

* OpenZFS 6051 - lzc_receive: allow the caller to read the begin record

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6051
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322

* OpenZFS 6393 - zfs receive a full send as a clone

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e

* OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS

Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6536
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b

* OpenZFS 6738 - zfs send stream padding needs documentation

Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6738
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff

* OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota

Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad

* OpenZFS 6562 - Refquota on receive doesn't account for overage

Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6

* Implement zfs_ioc_recv_new() for OpenZFS 2605

Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface.  The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.

ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler.  Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* OpenZFS 6314 - buffer overflow in dsl_dataset_name

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee

* OpenZFS 6876 - Stack corruption after importing a pool with a too-long name

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.

OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e

* Vectorized fletcher_4 must be 128-bit aligned

The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned.  This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations.  Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4330

* Allow building with `CFLAGS="-O0"`

If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.

Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4799

* Don't allow accessing XATTR via export handle

Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828

* Fix get_zfs_sb race with concurrent umount

Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.

P1                                          P2
---                                         ---
deactivate_locked_super(): s_active = 0
                                            zfs_sb_hold()
                                            ->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
                                            zfs_sb_rele(zsb)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Fix Large kmem_alloc in vdev_metaslab_init

This allocation can go way over 1MB, so we should use vmem_alloc
instead of kmem_alloc.

  Large kmem_alloc(1430784, 0x1000), please file an issue...
  Call Trace:
   [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl]
   [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs]
   [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs]
   [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs]
   [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs]
   [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs]
   [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs]
   [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs]
   [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0
   [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4752

* Add configure result for xattr_handler

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828

* fh_to_dentry should return ESTALE when generation mismatch

When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828

* xattr dir doesn't get purged during iput

We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Kill zp->z_xattr_parent to prevent pinning

zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.

This change partially reverts e89260a.  This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Fix RAIDZ_TEST tests

Remove stray trailing } which prevented the raidz stress tests from
running in-tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3937

* Fix handling of errors nvlist in zfs_ioc_recv_new()

zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4829

* Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.

The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4783

* Enable zpool_upgrade test cases

Creating the pool in a striped rather than mirrored configuration
provides enough space for all upgrade tests to run.  Test case
zpool_upgrade_007_pos still fails and must be investigated so
it has been left disabled.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4852

* Prevent null dereferences when accessing dbuf kstat

In arc_buf_info(), the arc_buf_t may have no header.  If not, don't try
to fetch the arc buffer stats and instead just zero them.

The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4837

* Fix dbuf_stats_hash_table_data race

Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4846

* Use native inode->i_nlink instead of znode->z_links

A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.

We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.

In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4838
Issue #227

* Implementation of SSE optimized Fletcher-4

Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4789

* Fix filesystem destroy with receive_resume_token

It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'.  Try to remove the hidden child (%recv) and after
that try to remove the target dataset.   If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.

Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4818

* Prevent segfaults in SSE optimized Fletcher-4

In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17.  This issue
was fixed in gcc 4.6.

To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.

Disable zimport testing against master where this flaw exists:

TEST_ZIMPORT_VERSIONS="installed"

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4862

* Update arc_summary.py for prefetch changes

Commit 7f60329 removed several kstats which arc_summary.py read.
Remove these kstats from arc_summary.py in the same way this was
handled in FreeNAS.

FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4695

* Wait iput_async before evict_inodes to prevent race

Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes.  This means it could
destroy the inode while we are still using it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4854

* Fixes and enhancements of SIMD raidz parity

- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392

* RAIDZ parity kstat rework

Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4860

* Fix NULL pointer in zfs_preumount from 1d9b3bd

When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.

In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4867
Issue #4854

* Illumos Crypto Port module added to enable native encryption in zfs

A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329

* Fix for compilation error when using the kernel's CONFIG_LOCKDEP

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329

* zloop: print backtrace from core files

Find the core file by using `/proc/sys/kernel/core_pattern`

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4874

* Fix for metaslab_fastwrite_unmark() assert failure

Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4267

* Remove znode's z_uid/z_gid member

Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227

* Check whether the kernel supports i_uid/gid_read/write helpers

Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227

* Fix uninitialized variable in avl_add()

Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.

module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
    in this function [-Wmaybe-uninitialized]
  avl_insert(tree, new_node, where);

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Fix sync behavior for disk vdevs

Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer.  This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.

In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start().  The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b.  In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.

The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency.  This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.

To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.

For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4858

* Limit the amount of dnode metadata in the ARC

Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on th…
nedbass pushed a commit to nedbass/zfs that referenced this issue Aug 26, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 3, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 5, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 5, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Sep 8, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Oct 19, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Oct 29, 2016
As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: openzfs#4726
@vladki77
Copy link

vladki77 commented Nov 24, 2016

Hi guys, I have run into similar problem with Debian 3.16.36-1+deb8u2, zfs-dkms 0.6.5.7-8-jessie.
I had to reboot as I could not access files on zfs at all. System itself was responsive as / and /var are on ext4. However rsync is running hourly and this happened first time after several months. Probably due to recent reboot to get new kernel and zfs updates (from 0.6.5.6). Now the system is running fine (knock-knock).

I would like to ask if you found some final solution - either by upgrade/downgrade or tweaking some zfs options. From reading the above posts I have the feeling that it was mostly a temporary solution and that the problem was back after a while. I'm about to upgrade to jessie-backports (0.6.5.8-1~bpo8+1)

Another question is whether your problem was "deterministic" i.e. every run of rsync caused problems, or just sometimes?

Thanks in advance for any hints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants