arc_reclaim / arc_prune out of control after running long rsync #4726

jonathanvaughn · 2016-06-04T00:53:37Z

This may be related to similar issues such as #4345 or #4239, but those didn't seem to quite match what I'm experiencing.

We've upgraded several machines recently from CentOS 6 to 7, and are using ZFS. I haven't seen this issue on all machines, only the most recent one, so I'm not sure what the issue is.

When performing a large rsync (many GB and hundreds of thousands of files, but not performing a checksum comparison so its mostly stat performance not read), arc fills up and the machine becomes nearly unresponsive due to arc_reclaim / arc_prune.

The dataset is set to primarycache=metadata so I wouldn't have expected it to fill faster than it could be dealt with. I had arc max set to 8GB (32GB machine) and in order to get it responsive again I had to increase arc to 12GB. Even after doing so, even with the machine idle, it's been well over half an hour and arc_reclaim / arc_prune are still running, even though there's more than 3GB free arc at this point (it grew past 8GB). I tried limiting metadata size to see if that would help, but it didn't seem to matter.

Running kernel 4.6.0-1.el7.elrepo.x86_64 on centos7, ZFS version v0.6.5.7-1 (same as the other machines which have had no issues).

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
19:49:56    82    53     64    51   63     2  100    53   64   8.7G   12G

ZFS Subsystem Report                            Fri Jun 03 19:49:22 2016
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                570.61k
        Mutex Misses:                           5.58k
        Evict Skips:                            5.58k

ARC Size:                               70.46%  8.45    GiB
        Target Size: (Adaptive)         100.00% 12.00   GiB
        Min Size (Hard Limit):          4.17%   512.00  MiB
        Max Size (High Water):          24:1    12.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       50.21%  6.03    GiB
        Frequently Used Cache Size:     49.79%  5.97    GiB

ARC Hash Breakdown:
        Elements Max:                           412.63k
        Elements Current:               34.67%  143.05k
        Collisions:                             93.65k
        Chain Max:                              3
        Chains:                                 1.99k

ARC Total accesses:                                     31.60m
        Cache Hit Ratio:                82.35%  26.02m
        Cache Miss Ratio:               17.65%  5.58m
        Actual Hit Ratio:               49.50%  15.64m

        Data Demand Efficiency:         75.67%  1.47m
        Data Prefetch Efficiency:       63.21%  3.16k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             39.65%  10.32m
          Most Recently Used:           14.17%  3.69m
          Most Frequently Used:         45.95%  11.96m
          Most Recently Used Ghost:     0.24%   62.46k
          Most Frequently Used Ghost:   0.00%   115

        CACHE HITS BY DATA TYPE:
          Demand Data:                  4.26%   1.11m
          Prefetch Data:                0.01%   2.00k
          Demand Metadata:              55.85%  14.53m
          Prefetch Metadata:            39.88%  10.38m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  6.39%   356.75k
          Prefetch Data:                0.02%   1.16k
          Demand Metadata:              91.17%  5.09m
          Prefetch Metadata:            2.42%   134.89k


File-Level Prefetch: (HEALTHY)
DMU Efficiency:                                 45.63m
        Hit Ratio:                      92.22%  42.08m
        Miss Ratio:                     7.78%   3.55m

        Colinear:                               3.55m
          Hit Ratio:                    0.04%   1.39k
          Miss Ratio:                   99.96%  3.55m

        Stride:                                 41.74m
          Hit Ratio:                    99.99%  41.74m
          Miss Ratio:                   0.01%   5.31k

DMU Misc:
        Reclaim:                                3.55m
          Successes:                    0.35%   12.56k
          Failures:                     99.65%  3.54m

        Streams:                                345.91k
          +Resets:                      0.01%   36
          -Resets:                      99.99%  345.87k
          Bogus:                                0


ZFS Tunable:
        metaslab_debug_load                               0
        zfs_arc_min_prefetch_lifespan                     0
        zfetch_max_streams                                8
        zfs_nopwrite_enabled                              1
        zfetch_min_sec_reap                               2
        zfs_dbgmsg_enable                                 0
        zfs_dirty_data_max_max_percent                    25
        zfs_arc_p_aggressive_disable                      1
        spa_load_verify_data                              1
        zfs_zevent_cols                                   80
        zfs_dirty_data_max_percent                        10
        zfs_sync_pass_dont_compress                       5
        l2arc_write_max                                   8388608
        zfs_vdev_scrub_max_active                         2
        zfs_vdev_sync_write_min_active                    10
        zvol_prefetch_bytes                               131072
        metaslab_aliquot                                  524288
        zfs_no_scrub_prefetch                             0
        zfs_arc_shrink_shift                              0
        zfetch_block_cap                                  256
        zfs_txg_history                                   0
        zfs_delay_scale                                   500000
        zfs_vdev_async_write_active_min_dirty_percent     30
        metaslab_debug_unload                             0
        zfs_read_history                                  0
        zvol_max_discard_blocks                           16384
        zfs_recover                                       0
        l2arc_headroom                                    2
        zfs_deadman_synctime_ms                           1000000
        zfs_scan_idle                                     50
        zfs_free_min_time_ms                              1000
        zfs_dirty_data_max                                3358756864
        zfs_vdev_async_read_min_active                    1
        zfs_mg_noalloc_threshold                          0
        zfs_dedup_prefetch                                0
        zfs_vdev_max_active                               1000
        l2arc_write_boost                                 8388608
        zfs_resilver_min_time_ms                          3000
        zfs_vdev_async_write_max_active                   10
        zil_slog_limit                                    1048576
        zfs_prefetch_disable                              0
        zfs_resilver_delay                                2
        metaslab_lba_weighting_enabled                    1
        zfs_mg_fragmentation_threshold                    85
        l2arc_feed_again                                  1
        zfs_zevent_console                                0
        zfs_immediate_write_sz                            32768
        zfs_dbgmsg_maxsize                                4194304
        zfs_free_leak_on_eio                              0
        zfs_deadman_enabled                               1
        metaslab_bias_enabled                             1
        zfs_arc_p_dampener_disable                        1
        zfs_object_mutex_size                             64
        zfs_metaslab_fragmentation_threshold              70
        zfs_no_scrub_io                                   0
        metaslabs_per_vdev                                200
        zfs_dbuf_state_index                              0
        zfs_vdev_sync_read_min_active                     10
        metaslab_fragmentation_factor_enabled             1
        zvol_inhibit_dev                                  0
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_vdev_cache_size                               0
        zfs_vdev_mirror_switch_us                         10000
        zfs_dirty_data_sync                               67108864
        spa_config_path                                   /etc/zfs/zpool.cache
        zfs_dirty_data_max_max                            8396892160
        zfs_arc_lotsfree_percent                          10
        zfs_zevent_len_max                                128
        zfs_scan_min_time_ms                              1000
        zfs_arc_sys_free                                  0
        zfs_arc_meta_strategy                             1
        zfs_vdev_cache_bshift                             16
        zfs_arc_meta_adjust_restarts                      4096
        zfs_max_recordsize                                1048576
        zfs_vdev_scrub_min_active                         1
        zfs_vdev_read_gap_limit                           32768
        zfs_arc_meta_limit                                8589934592
        zfs_vdev_sync_write_max_active                    10
        l2arc_norw                                        0
        zfs_arc_meta_prune                                10000
        metaslab_preload_enabled                          1
        l2arc_nocompress                                  0
        zvol_major                                        230
        zfs_vdev_aggregation_limit                        131072
        zfs_flags                                         0
        spa_asize_inflation                               24
        zfs_admin_snapshot                                0
        l2arc_feed_secs                                   1
        zio_taskq_batch_pct                               75
        zfs_sync_pass_deferred_free                       2
        zfs_disable_dup_eviction                          0
        zfs_arc_grow_retry                                0
        zfs_read_history_hits                             0
        zfs_vdev_async_write_min_active                   1
        zfs_vdev_async_read_max_active                    3
        zfs_scrub_delay                                   4
        zfs_delay_min_dirty_percent                       60
        zfs_free_max_blocks                               100000
        zfs_vdev_cache_max                                16384
        zio_delay_max                                     30000
        zfs_top_maxinflight                               32
        spa_slop_shift                                    5
        zfs_vdev_write_gap_limit                          4096
        spa_load_verify_metadata                          1
        spa_load_verify_maxinflight                       10000
        l2arc_noprefetch                                  1
        zfs_vdev_scheduler                                noop
        zfs_expire_snapshot                               300
        zfs_sync_pass_rewrite                             2
        zil_replay_disable                                0
        zfs_nocacheflush                                  0
        zfs_arc_max                                       12884901888
        zfs_arc_min                                       536870912
        zfs_read_chunk_size                               1048576
        zfs_txg_timeout                                   5
        zfs_pd_bytes_max                                  52428800
        l2arc_headroom_boost                              200
        zfs_send_corrupt_data                             0
        l2arc_feed_min_ms                                 200
        zfs_arc_meta_min                                  0
        zfs_arc_average_blocksize                         8192
        zfetch_array_rd_sz                                1048576
        zfs_autoimport_disable                            1
        zfs_arc_p_min_shift                               0
        zio_requeue_io_start_cut_in_line                  1
        zfs_vdev_sync_read_max_active                     10
        zfs_mdcomp_disable                                0
        zfs_arc_num_sublists_per_state                    8

NAME                                                                                                USED  AVAIL  REFER  MOUNTPOINT
REDACTED_pool0                                                                                        663G  1.11T    96K  /REDACTED_pool0
REDACTED_pool0/ROOT                                                                                  2.18G  1.11T  2.18G  /REDACTED_pool0/ROOT
REDACTED_pool0/atlassian-data                                                                        68.0G  1.11T  68.0G  /var/atlassian
REDACTED_pool0/atlassian-opt                                                                         3.06G  1.11T  3.06G  /opt/atlassian
REDACTED_pool0/docker                                                                                3.93M  1.11T  3.93M  /var/docker
REDACTED_pool0/docker-storage                                                                         932M  1.11T  3.93M  /var/lib/docker
REDACTED_pool0/docker-storage/0056857a64482466f0237a0151573bf9d25978be72309cffd2469ff2a7c563db        356K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/00ac923f6cd89b24efa1b6e6c979f6742e1d7254d55d36e9a4714b2227cfe425        756K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/012e103c0fcc67724a911f0c498a021a56f68f9e81d8dff7efcfbfae98514329        148K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/0480ce2f1b482c7de707a34af5b5939ef5766d36e418eb0851255529186b9250        140K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/0baabef050bdca13b0d07a49fd2970e4fe539d7508f4baf1620322e2e52db0cb        132K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/106eb69bddb28b4cd151964833d657f35172b48a0806f8a08c9e812768295dc1         96K  1.11T   337M  legacy
REDACTED_pool0/docker-storage/184df186e387e4bc17ac9bc1533480b57410215edd76f59c99515baff97137c7        636K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/1a606111d2307b93d61263375b1d497f2f1448d0782fe605bd5dba3bc7c9a462        232K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/1e3826605d1afdbd80412f04346cdbea14259483e77c785afd760830417ab1b3        140K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/1feff74a8ca94e76ec99a5760dde6bd4bd4b2d5422414f572a4af1d65bba5571       5.77M  1.11T   377M  legacy
REDACTED_pool0/docker-storage/27edf3335bc0054d7f445ffa3f5f6ab9c0b4489f930dbe7d01cb102c1c2ebecb        140K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/2897462c8ec5deb33b59a63611d4b3ec0b0026235410aac961fbfa4d0413ca11        140K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/32b6044561f0a6021fd0ef6a3fb80f882ccd8b6430f84c3499327477422099f7        112K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/35c32fbbb9a659ad206f549b6c110044d69c0f3ebe4a10cf3ffa85054d7229b8        112K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/35eb4d2ea04eefa3cff4a64ac68916eeab8f75c899b9faf6e8b0b690241140fb        148K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/36ed577a48877e1a8efb8af3b197090c8b8a6c0920608cd9c1867d8ea9260ff0        156K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/3dc93a1d00d332f23afac47250baf0494704d60c2aeeee62852e99b843657b51       7.47M  1.11T   125M  legacy
REDACTED_pool0/docker-storage/3e4c71cebc543f819b35527786c04ae3499c9c5fabc62080e84b4bb5a66e4b03        104K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/3efa2aa1e0582899d6faaef860e4781f828626ab8bbbe978342a76d97fc82f68        140K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/445e327c7b229541de7860c31a096764cd6a420fd609d05572b3a4d5bb1edd0e        800K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/445e327c7b229541de7860c31a096764cd6a420fd609d05572b3a4d5bb1edd0e-init   176K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/46448c001a786f0c58148511ec8ec6dbb264e5bff981b1ed326f94d2e86bcd7c        356K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/47d44cb6f252ea4f6aecf8a447972de5d9f9f2e2bec549a2f1d8f92557f4d05a        104K  1.11T    96K  legacy
REDACTED_pool0/docker-storage/49ad0c973296a46ef7fcc70a24fe4640dfba286076e9302b5cdf828abdd13501        132K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/4e9f14010a5270b899c038a6e9cc9d4b6206422bc625f2470df41cec44a9cce0        232K  1.11T   398M  legacy
REDACTED_pool0/docker-storage/4f5c015cca28acd98b93a99c2d0695a0f6068a0b972a6d999901f715d44f4508       61.9M  1.11T   391M  legacy
REDACTED_pool0/docker-storage/511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158        104K  1.11T    96K  legacy
REDACTED_pool0/docker-storage/5156bebbccf9cc966f8fd16cf277340c609fe8970f68273dc52d4894e50498b1        185M  1.11T   362M  legacy
REDACTED_pool0/docker-storage/5764f0a3131791360948d70cc2714226a1ec786675d27e09348abd4adecb03ea        104K  1.11T   125M  legacy
REDACTED_pool0/docker-storage/594cd51330f70859dd1dd919f3f2d3ab89f0d5f55153eb829bb084524054eadb       5.75M  1.11T   399M  legacy
REDACTED_pool0/docker-storage/5ae46cd541a55793c3963478066abf7b9c2be42ab6864c8c1618e5021306c3b4        148K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/5b12ef8fd57065237a6833039acc0e7f68e363c15d8abb5cacce7143a1f7de8a         72K  1.11T    96K  legacy
REDACTED_pool0/docker-storage/5ba63a7ec5d003675f6c0f692807be7d131c395f35c89386285d807e1caedeca       4.49M  1.11T   393M  legacy
REDACTED_pool0/docker-storage/5d3dc7393d4ae5a85678ace6b8e1bfa7404c56199d94b7270c5978857d6bb8f6        156K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/60e65a8e4030022260a4f84166814b2683e1cdfc9725a9c262e90ba9c5ae2332        104K  1.11T   125M  legacy
REDACTED_pool0/docker-storage/6856d39a282fe617098075475f2a857a2a297645b91cfe4dc3bbf0fb25cca214        984K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/6856d39a282fe617098075475f2a857a2a297645b91cfe4dc3bbf0fb25cca214-init   176K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/6ae1c96016a58e20db8987812a10d452171b9376c38c043f17f420d498a23b17        112K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/6cec9a305daa7981fde96f6633443a3c78e4d8385fa08e7ab0e18d3377784d35        148K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/7080051af98969ef315b1579f289aa1a1dd99834922ae022a726d6468a8db880       2.36M  1.11T   372M  legacy
REDACTED_pool0/docker-storage/71098c77ff51f5f2d22f09d2913f2a933f92618d55ff4cd1f3e4c750de7e3cad        788K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/79403fcabcbce6f2f1f33183c05caeba3395601c42310b6cca49aa4ba79dd58c       38.7M  1.11T   389M  legacy
REDACTED_pool0/docker-storage/838c1c5c4f833fda62e928de401303d293d23d52c831407b12edd95ca3f1839e        125M  1.11T   125M  legacy
REDACTED_pool0/docker-storage/84407cfa9ecdb7463512414e4a8fe1a6fba690f0b71b9b52a18b0a72d3410742        104K  1.11T   125M  legacy
REDACTED_pool0/docker-storage/8b83cd0a5724dc8d4c6c74c7a14b546f58f6aab8eecdf8bff711f3f83a4a4d03        140K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/8f53bed55de21acc781996cbbf664e3fae213e575d1d13a8e491b5178396b3f7        164M  1.11T   337M  legacy
REDACTED_pool0/docker-storage/9ca1d808cb59542f9e1510bbb47e7e138c0e29908c0dfd5e7bb68e7d2b959644       1.06M  1.11T   377M  legacy
REDACTED_pool0/docker-storage/9ca1d808cb59542f9e1510bbb47e7e138c0e29908c0dfd5e7bb68e7d2b959644-init   232K  1.11T   377M  legacy
REDACTED_pool0/docker-storage/a6c4f4c57f963c985747361f7b96948161ee588feb680de40ad966fe8c2e65d4        104K  1.11T   377M  legacy
REDACTED_pool0/docker-storage/b1b7f0f37901c473c30ded439bdd84d39cd5750721873fa821d94493dfa695d4       71.6M  1.11T   187M  legacy
REDACTED_pool0/docker-storage/b3a2c44bb00bee64a0d0c42e8643646eb41df8cb38960b69ed318bc0533a72e4        112K  1.11T   133M  legacy
REDACTED_pool0/docker-storage/b448af7556ed98e73b8439965f7b53d527f1f178281e25ed6b4fe9e9d2e83564       1.86M  1.11T   393M  legacy
REDACTED_pool0/docker-storage/b5ced8b2946c34da022c8d7fe1fc57528ce6b15b3ad2fbed01c79f5e69d00d76        104K  1.11T   399M  legacy
REDACTED_pool0/docker-storage/ca5966735c535a0a9150496df6badd5da1187384b53b3debf28d0b00b80bceb5       5.80M  1.11T   398M  legacy
REDACTED_pool0/docker-storage/cbdb5fea83decdf0e9a3cd3e9495550fe5022abcd4b8c6b5620dddfe831df4de       4.71M  1.11T   133M  legacy
REDACTED_pool0/docker-storage/cc0cb24fc14d3ce6e75d6c4818632add7972bea82371f25a2fbefc6190ca2e16        148K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/ced21ad71cbe6b06e41d95192c4b27b330f37bdba0a5ae32df571115601d1a88        240K  1.11T   377M  legacy
REDACTED_pool0/docker-storage/d3bae8310b00855a43751fa7ed47ad35581cb32e5f6e61d3835bd16be86d8211        416K  1.11T   183M  legacy
REDACTED_pool0/docker-storage/d9ecc821c7ccba515c5040a33f0b6e8a9eb419775c9474687659f350022961df        112K  1.11T   362M  legacy
REDACTED_pool0/docker-storage/e005ab5515593115e9e4953fc3fbc5827f9182b28774b3055148eca822daf479       42.4M  1.11T   370M  legacy
REDACTED_pool0/docker-storage/e355356651cc59ae780a08a19ae45180d9ab62c1618652e7671dfbc27fe5148f        132K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/ed454ebfb7d17bd061730e68cd42028b3d26b9e341e5bd668b75b506e5ccb012        112K  1.11T   187M  legacy
REDACTED_pool0/docker-storage/f6808a3e4d9e80a655ec625e38b869ed8a614611e4d0073aeff23be841c9fcff        133M  1.11T   133M  legacy
REDACTED_pool0/docker-storage/f75d9f29b700f0dfa47012b53061affbdf47b2f111ea15d336bbcc0eba3c370a        112K  1.11T   362M  legacy
REDACTED_pool0/docker-storage/fb72f570f4b4da69d5903190750b1f6225b1547a705a4ab5bef8bcec9507a5c1       56.9M  1.11T   183M  legacy
REDACTED_pool0/share_REDACTED                                                                       3.56G  1.11T  3.56G  /usr/share/REDACTED
REDACTED_pool0/swap                                                                                  34.0G  1.14T    64K  -
REDACTED_pool0/www                                                                                    551G  1.11T  1.32M  /var/www
REDACTED_pool0/www/mediarepo-local                                                                   26.4G  1.11T  26.4G  /var/www/mediarepo-local
REDACTED_pool0/www/preview.REDACTED.com                                                              281G  1.11T   281G  /var/www/preview.REDACTED.com
REDACTED_pool0/www/qa.REDACTED.net                                                                   243G  1.11T   243G  /var/www/qa.REDACTED.net

The text was updated successfully, but these errors were encountered:

dweeezil · 2016-06-04T12:51:21Z

@jonathanvaughn Could you please post the full /proc/spl/kstat/zfs/arcstats when this is happening. I suspect there are other process competing for memory. By default, on a 32GiB system, ZoL is going to want to keep at least 512MiB free memory in the system. This behavior can be adjusted with the newish zfs_arc_sys_free tunable if necessary.

jonathanvaughn · 2016-06-04T21:16:31Z

I've tried turning primarycache=all on for most of the datasets in case it was some weird behavior with only metadata being cachable, it still happens.

Here's the output from free -m:

              total        used        free      shared  buff/cache   available
Mem:          32031         884       13880          51       17266       14399
Swap:         32767           0       32767

As you can see, there's plenty of it at the moment.

Here's the output you wanted:

6 1 0x01 91 4368 1579813499 38581696671350
name                            type data
hits                            4    21401257
misses                          4    3077974
demand_data_hits                4    119714
demand_data_misses              4    52075
demand_metadata_hits            4    12301614
demand_metadata_misses          4    2877192
prefetch_data_hits              4    7
prefetch_data_misses            4    409
prefetch_metadata_hits          4    8979922
prefetch_metadata_misses        4    148298
mru_hits                        4    4113558
mru_ghost_hits                  4    6889
mfu_hits                        4    8307770
mfu_ghost_hits                  4    126
deleted                         4    256144
mutex_miss                      4    796
evict_skip                      4    59340458
evict_not_enough                4    14941394
evict_l2_cached                 4    0
evict_l2_eligible               4    1496560128
evict_l2_ineligible             4    289325056
evict_l2_skip                   4    0
hash_elements                   4    135033
hash_elements_max               4    364571
hash_collisions                 4    18243
hash_chains                     4    1395
hash_chain_max                  4    3
p                               4    4401385984
c                               4    8589934592
c_min                           4    536870912
c_max                           4    8589934592
size                            4    8629517472
hdr_size                        4    57255552
data_size                       4    0
metadata_size                   4    2192923136
other_size                      4    6379338784
anon_size                       4    32768
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    2159458816
mru_evictable_data              4    0
mru_evictable_metadata          4    81920
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    33431552
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    1261
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    684374023
arc_meta_used                   4    8629517472
arc_meta_limit                  4    8589934592
arc_meta_max                    4    8629603384
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    524804096

Top:

top - 16:22:40 up 10:49,  3 users,  load average: 5.80, 5.71, 5.59
Tasks: 689 total,   6 running, 680 sleeping,   3 stopped,   0 zombie
%Cpu(s):  0.0 us, 42.2 sy,  0.0 ni, 57.4 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32800360 total, 14204412 free,   915028 used, 17680920 buff/cache
KiB Swap: 33554428 total, 33554428 free,        0 used. 14736192 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  269 root      20   0       0      0      0 R  99.0  0.0  47:30.23 arc_reclaim
  262 root      20   0       0      0      0 S  35.4  0.0  16:32.77 arc_prune
  267 root      20   0       0      0      0 S  35.4  0.0  16:39.98 arc_prune
  263 root      20   0       0      0      0 R  35.1  0.0  16:06.82 arc_prune
  264 root      20   0       0      0      0 R  35.1  0.0  16:22.74 arc_prune
  268 root      20   0       0      0      0 R  35.1  0.0  16:08.26 arc_prune
  261 root      20   0       0      0      0 S  34.8  0.0  16:27.70 arc_prune
  265 root      20   0       0      0      0 R  34.8  0.0  16:14.10 arc_prune
  266 root      20   0       0      0      0 S  34.8  0.0  16:18.26 arc_prune
 7165 root      20   0  158372   3580   2332 R   0.7  0.0   0:00.71 top
    7 root      20   0       0      0      0 S   0.3  0.0   0:53.41 rcu_sched
 1364 root       1 -19       0      0      0 S   0.3  0.0   0:02.37 z_wr_iss
 7184 root      20   0       0      0      0 S   0.3  0.0   0:00.03 kworker/6:1
    1 root      20   0   41572   4208   2496 S   0.0  0.0   0:03.20 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.03 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.35 ksoftirqd/0

jonathanvaughn · 2016-06-04T21:27:56Z

The pool status

  pool: REDACTED_pool0
 state: ONLINE
  scan: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        REDACTED_pool0                             ONLINE       0     0     0
          mirror-0                                 ONLINE       0     0     0
            ata-ST2000DM001-1CH164_S1E1QXHZ-part2  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_S1E1R42C-part2  ONLINE       0     0     0

errors: No known data errors

dweeezil · 2016-06-05T13:29:28Z

@jonathanvaughn Your system is having trouble keeping the metadata under the limit and its not showing much evictable memory. Try setting the tunable zfs_arc_meta_strategy to zero and see if the traditional metadata-only adjuster doesn't work better.

jonathanvaughn · 2016-06-05T18:31:11Z

This seems to be an improvement.

So far, I was able to complete an rsync of one dataset (remote machine not ZFS, so not ZFS send/recv) without it locking up. arc_reclaim is still using about 60% CPU but I don't see a bunch of arc_prune in addition and the system is responsive still with load only ~1. ARC size was just a bit over the limit (8.04GB vs 8GB) and I've started another rsync of another dataset and it is growing beyond the limit still (currently at 9.21GB vs 8GB limit) but the system is responsive thus far and otherwise "working as expected".

jonathanvaughn · 2016-06-05T21:04:15Z

After those two rsyncs (total files+directories in these datasets are over 7.3million), ARC size is 14.24GB vs 8GB limit set. Setting the datasets to primarycache=metadata and rebooting, and doing all of this again, the result is the same - over 14GB of ARC usage even though the limit is set to 8GB.

So there presumably is a problem preventing the removal of metadata from ARC to stay within the ARC size, but data blocks do get removed from ARC to make room?

dweeezil · 2016-06-06T04:25:06Z

@jonathanvaughn You might be better off with the default "balanced" strategy. There are a couple of parameters which it uses: zfs_arc_meta_adjust_restarts (default 4096) controls the number of passes made over data & metadata and likely represents all the CPU being used by the arc_reclaim thread. The zfs_arc_meta_prune parameter (default 10000) is the amount of objects it scans and it's doubled on every other pass. Try lowering zfs_arc_meta_adjust_restarts to maybe 4 or 8 and increasing zfs_arc_meta_prune much larger; maybe 100000, 500000 or even 1000000 and see if it doesn't work better.

jonathanvaughn · 2016-06-06T05:06:09Z

Trying right now with restarts 8 and prune 100000. So far this seems to be keeping arc_reclaim from going crazy.

Is there a technical reason why there's not some kind of built in abort to the reclaim loop if it's taking more than a some sane amount of time to finish reclaiming?

Either way, if this solves it, great. Eating ~50% of RAM for this server would be not great but survivable (just) but the next server I had on my plate to upgrade OS and go to ZFS on has ~27 million files vs ~7 million, assuming a linear increase in RAM required for metadata it wouldn't even fit in memory. D:

jonathanvaughn · 2016-06-06T05:22:28Z

Looks like I spoke too soon / didn't wait long enough. I am going to continue to fiddle with those two settings though and see if some combination works out.

dweeezil · 2016-06-06T12:20:21Z

@jonathanvaughn There's not been a report of runaway metadata in awhile so I'm wondering if there's anything else special with your setup. I ran a few of my normal tests and wasn't able to duplicate the situation. I'm going to start by switching to a 4.6 kernel since my main tests system is running 4.4.6 at the moment to make sure it's not something with the kernel. Could you, for example, be using xattrs extensively? Are you using --fake-super?

In balanced mode, it tries to unpin metadata by calling into the Linux superblock shrinker super_cache_scan() in the arc_prune thread by requesting scans of multiples of zfs_arc_meta_prune objects. You can use egrep '^(inode_cache|dentry)' /proc/slabinfo to get an idea how successful it is.

dweeezil · 2016-06-06T13:08:44Z

@jonathanvaughn In my initial tests on a 4.6 kernel, it looks like I was able to duplicate the problem. I'm going back to 4.4.6 to make sure it really didn't happen there.

jonathanvaughn · 2016-06-06T18:41:22Z

So the reason I didn't initially have total success was I forgot to change the strategy back, and after I did so I was unable to reboot to clear ARC (because our users were coming online). However, over the last ~12 hours those ZFS parameter changes did work, and ARC size is now staying at the ARC maximum. I guess because there were already some arc_reclaim runs that were using the old strategy it took a long time for it to finally switch over, but once it did things are working as expected.

We didn't have any issues on the previous servers, which actually had more data, but they were databases so the file count was far lower (few large vs many small).

I guess we can close this?

jonathanvaughn · 2016-06-06T18:46:45Z

For whatever reason Github hadn't refreshed so I didn't see your last updates.

I don't think we're using xattrs extensively, but we are using xattr=sa setting. I am not aware of any specific uses of xattrs, but there might be some projects that have used them (so could be some 10's or 100's of thousands of files, out of the millions).

I wasn't using --fake-super (I was running rsync as root on both ends).

# egrep '^(inode_cache|dentry)' /proc/slabinfo
inode_cache        19316  19516    568   28    4 : tunables    0    0    0 : slabdata    697    697      0
dentry            4122640 4126290    192   21    1 : tunables    0    0    0 : slabdata 196490 196490      0

Not sure how relevant the slabinfo is currently since the problem is "fixed".

Current settings for zfs_arc_meta_adjust_restarts is 4 and zfs_arc_meta_prune is 500000

I will try relaxing those to 8 and 100000 later, which is where I started, but since I already had ARC data exceeding the arc size it took time for arc_reclaim to use the new settings and I kept changing things ...

dweeezil · 2016-06-06T19:35:41Z

@jonathanvaughn This is a regression caused by changes between the 4.5 and 4.6 kernel. I'm working on a patch.

dweeezil · 2016-06-06T21:32:35Z

The problem appears to be the continuing evolution of memory cgroups (memcg). If you boot with cgroup_disable=memory the reclaiming should start working again. I've not worked up a patch yet.

jonathanvaughn · 2016-06-06T22:07:17Z

I will try to test this in a few hours when we have no active users of that system.

jonathanvaughn · 2016-06-07T00:39:51Z

I am starting the test now on the existing server with cgroup_disable=memory setting.

Also building yet another new server, this one so far isn't having any problems even without making those ZFS parameter changes. The only difference are : hard drives (3TB vs 2TB and different brand) which seems unlikely to be related, CPU (previous one was AMD FX-8320, this one I'm building now is FX-8350), and kernel version (the new server is on kernel 4.6.1 vs 4.6.0 of the previous one - since that is now the latest kernel-ml package from ELrepo).

It may be that whatever was "broken" in 4.6.0 has been "fixed" in 4.6.1. I'm letting things run for awhile on both servers to try and verify that (A) cgroup_disable=memory fixes kernel 4.6.0 and (B) kernel 4.6.1 works without any special changes (either kernel boot args or ZFS parameters).

jonathanvaughn · 2016-06-07T01:04:36Z

cgroup_disable=memory definately solve the problem under 4.6.0. I am testing that same machine in 4.6.1 now since 4.6.1 is working on the other machine without any special configuration.

jonathanvaughn · 2016-06-07T01:13:49Z

Upgrading to 4.6.1 solved it for the machine I originally posted this issue about.

So it looks like whatever "they" broke in kernel 4.6.0 "they" fixed in kernel 4.6.1. So as far as I am personally concerned, there's no need to make a patch for 4.6.0 to solve this.

dweeezil · 2016-06-08T21:27:12Z

The problem is caused by the continued development and integration of memory cgroups. The key commit which caused the ZoL shrinker to stop working is torvalds/linux@b313aee.

The obvious fix is to make the ZoL shrinker memcg-aware similarly to the way in which it was made NUMA-aware. Unfortunately, this doesn't seem possible as far as I can tell because mem_cgroup_iter() isn't exported so some other hackish solution may be required.

I'll note, too, that this is certainly not "fixed" in 4.6.1 mainly because nothing is actually broken.

jonathanvaughn · 2016-06-08T21:56:25Z

Well, I've repeatedly churned through 3x as much data as would cause the ARC to grow uncontrollably without any further issues, so the problem is at least no longer presenting itself (yet) under the same circumstances it was previously.

dweeezil · 2016-06-08T22:09:52Z

@jonathanvaughn I can't explain why 4.6.1 would work OK. I took a pretty good look at all 101 commits between 4.6 and 4.6.1 and there sure doesn't look like anything that would impact memory cgroups at all. Does your 4.6.1 system still mount it in /sys/fs/cgroup/memory? Does its arc_meta_used always stay well below arc_meta_limit and not overshoot by much?

jonathanvaughn · 2016-06-08T22:57:33Z

It overshoots occasionally but not abnormally - i.e. a few percent, and then goes back to the limit.

/sys/fs/cgroup/memory/

total 0
-rw-r--r--   1 root root 0 Jun  7 05:20 cgroup.clone_children
--w--w--w-   1 root root 0 Jun  7 05:20 cgroup.event_control
-rw-r--r--   1 root root 0 Jun  7 05:20 cgroup.procs
-r--r--r--   1 root root 0 Jun  7 05:20 cgroup.sane_behavior
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.failcnt
--w-------   1 root root 0 Jun  7 05:20 memory.force_empty
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.failcnt
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.max_usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.kmem.slabinfo
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.failcnt
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.max_usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.kmem.tcp.usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.kmem.usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.max_usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.memsw.failcnt
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.memsw.limit_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.memsw.max_usage_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.memsw.usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.move_charge_at_immigrate
-r--r--r--   1 root root 0 Jun  7 05:20 memory.numa_stat
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.oom_control
----------   1 root root 0 Jun  7 05:20 memory.pressure_level
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.soft_limit_in_bytes
-r--r--r--   1 root root 0 Jun  7 05:20 memory.stat
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.swappiness
-r--r--r--   1 root root 0 Jun  7 05:20 memory.usage_in_bytes
-rw-r--r--   1 root root 0 Jun  7 05:20 memory.use_hierarchy
-rw-r--r--   1 root root 0 Jun  7 05:20 notify_on_release
-rw-r--r--   1 root root 0 Jun  7 05:20 release_agent
drwxr-xr-x 137 root root 0 Jun  7 05:20 system.slice
-rw-r--r--   1 root root 0 Jun  7 05:20 tasks
drwxr-xr-x   2 root root 0 Jun  7 05:20 user.slice

jonathanvaughn · 2016-06-09T00:03:48Z

For what it's worth, I've built yet another machine today and the kernel-ml package is now 4.6.2. It also appears to work fine so far (holding at no more than a fraction of a % over the ARC limit).

jonathanvaughn · 2016-06-10T08:38:35Z

Well, one of the servers (currently on 4.6.1) just had this issue again, though I don't know why. It wasn't having any higher level of IO than normal, nowhere near what happened originally to cause this issue. I did notice that in arcstat.py the 'c' was below the 8GB I set (7.6GB) and the arcsz was around 8GB. There was almost 8GB of free RAM so there shouldn't have been any memory pressure to force 'c' below the 8GB I set it to. I was (eventually) able to login and set the strategy to 0, which immediately caused load to drop, but the arcsz is going up past 8GB. I set restarts to 4 and prune to 500000 and set strategy back to 1 ... hopefully the arcsz stops growing and drops back to the max 8GB I set.

dweeezil · 2016-06-10T13:34:06Z

This problem is difficult to track down because it depends on the place in the memcg hierarchy at which the process performing the allocations resides. I'm running my test program as a normal logged-in user and on my Ubuntu 14.04 system, my shell appears in the hierarchy at .../memory/user/<uid>.user/<sessno>.session. When my shell, and therefore the testing programs are in this deeper part of the hierarchy, the reclaim doesn't happen. If I move the shell to the root of the hierarchy with echo <pid> >> /proc/fs/cgroup/memory/tasks and run the test, the reclaim works perfectly fine.

ZoL is calling the generic superblock shrinker with a NULL memcg and this is what's stopped working in the newer kernels due to other memcg-related changes. Unfortunately, the support necessary to make ZoL's shrinker wrapper memcg-aware are either not exported from the kernel or are exported GPL-only.

jonathanvaughn · 2016-06-10T19:17:24Z

Since there's no telling when these issues will be resolved (either by finding a work around on ZFS' side or providing access to the necessary internals on the kernel side), I guess I'll need to downgrade kernels on these machines.

4.4.13 should be safe? latest in elrepo-kernel repo as kernel-lt. kernel-ml doesn't appear to have any older versions.

dweeezil · 2016-06-16T15:32:58Z

There really doesn't seem to be a good way of making the shrinker memcg-aware given the interfaces exported from, at least, the 4.6 series of kernels so I think the best solution for now is to fall back to the older d_prune_aliases() scheme if necessary. The patch in f7d22c4 seems to work just fine.

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Fixes: openzfs#4726

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: openzfs#4726

* Consistently use parsable instead of parseable This is a purely cosmetical change, to consistently prefer one of two (both acceptable) choises for the word parsable in documentation and code. I don't really care which to use, but acording to wiktionary https://en.wiktionary.org/wiki/parsable#English parsable is preferred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4682 * Add missing RPM BuildRequires Both libudev and libattr are recommended build requirements. As such their development headers should lists in the rpm spec file so those dependencies are pulled in when building rpm packages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4676 * Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4514 Closes #4661 Closes #4672 * Improve zfs-module-parameters(5) Various rewrites to the descriptions of module parameters. Corrects spelling mistakes, makes descriptions them more user-friendly and describes some ZFS quirks which should be understood before changing parameter values. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4671 * Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4687 Closes #4690 * Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4659 * OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files * Systemd configuration fixes * Disable zfs-import-scan.service by default. This ensures that pools will not be automatically imported unless they appear in the cache file. When this service is explicitly enabled pools will be imported with the "cachefile=none" property set. This prevents the creation of, or update to, an existing cache file. $ systemctl list-unit-files | grep zfs zfs-import-cache.service enabled zfs-import-scan.service disabled zfs-mount.service enabled zfs-share.service enabled zfs-zed.service enabled zfs.target enabled * Change services to dynamic from static by adding an [Install] section and adding 'WantedBy' tags in favor of 'Requires' tags. This allows for easier customization of the boot behavior. * Start the zfs-import-cache.service after the root pivot so the cache file is available in the standard location. * Start the zfs-mount.service after the systemd-remount-fs.service to ensure the root fs is writeable and the ZFS filesystems can create their mount points. * Change the default behavior to only load the ZFS kernel modules in zfs-import-*.service or when blkid(8) detects a pool. Users who wish to unconditionally load the kernel modules must uncomment the list of modules in /lib/modules-load.d/zfs.conf. Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4325 Closes #4496 Closes #4658 Closes #4699 * Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <47034221@qq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4652 * Add isa_defs for MIPS GCC for MIPS only defines _LP64 when 64bit, while no _ILP32 defined when 32bit. Signed-off-by: YunQiang Su <syq@debian.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4712 * Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708 * Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4706 Issue #4708 * Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff ........@.p..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4707 Issue #4708 * Linux 4.7 compat: handler->set() takes both dentry and inode Counterpart to fd4c7b7, the same approach was taken to resolve the compatibility issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4717 Issue #4665 * Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4330 * Fix cstyle.pl warnings As of perl v5.22.1 the following warnings are generated: * Redundant argument in printf at scripts/cstyle.pl line 194 * Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/\S{ <-- HERE / at scripts/cstyle.pl line 608. They have been addressed by escaping the left braces and by providing the correct number of arguments to printf based on the fmt specifier set by the verbose option. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4723 * Fix minor spelling mistakes Trivial spelling mistake fix in error message text. * Fix spelling mistake "adminstrator" -> "administrator" * Fix spelling mistake "specificed" -> "specified" * Fix spelling mistake "interperted" -> "interpreted" Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4728 * Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list | iostat | status | get] * zfs [list | get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487 * Remove libzfs_graph.c The libzfs_graph.c source file should have been removed in 330d06f, it is entirely unused. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4766 * Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: #4726 * SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4328 * Fix NFS credential The commit f74b821 caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4772 Closes #4758 * OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>a Ported by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. * Add a test case for dmu_free_long_range() to ztest Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4754 * Revert "Add a test case for dmu_free_long_range() to ztest" This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which introduced a new test case to ztest which is failing occasionally during automated testing. The change is being reverted until the issue can be fully investigated. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4754 * OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Authored by: Nav Ravindranath <nav@delphix.com> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787 * FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <smh@FreeBSD.org> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790 * xattrtest: allow verify with -R and other improvements - Use a fixed buffer of random bytes when random xattr values are in effect. This eliminates the potential performance bottleneck of reading from /dev/urandom for each file. This also allows us to verify xattrs in random value mode. - Show the rate of operations per second in addition to elapsed time for each phase of the test. This may be useful for benchmarking. - Set default xattr size to 6 so that verify doesn't fail if user doesn't specify a size. We need at least six bytes to store the leading "size=X" string that is used for verification. - Allow user to execute just one phase of the test. Acceptable values for -o and their meanings are: 1 - run the create phase 2 - run the setxattr phase 3 - run the getxattr phase 4 - run the unlink phase Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542 * Sync DMU_BACKUP_FEATURE_* flags Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING. The DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and then reserved in the upstream OpenZFS implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #4795 * OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration. * OpenZFS 6051 - lzc_receive: allow the caller to read the begin record Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6051 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322 * OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e * OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <astormont@racktopsystems.com> Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b * OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <eli.rosenthal@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff * OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad * OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6 * Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee * OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e * Vectorized fletcher_4 must be 128-bit aligned The fletcher_4_native() and fletcher_4_byteswap() functions may only safely use the vectorized implementations when the buffer is 128-bit aligned. This is because both the AVX2 and SSE implementations process four 32-bit words per iterations. Fallback to the scalar implementation which only processes a single 32-bit word for unaligned buffers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Issue #4330 * Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4799 * Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828 * Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4752 * Add configure result for xattr_handler Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828 * fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828 * xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts e89260a. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Fix RAIDZ_TEST tests Remove stray trailing } which prevented the raidz stress tests from running in-tree. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937 * Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4829 * Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4783 * Enable zpool_upgrade test cases Creating the pool in a striped rather than mirrored configuration provides enough space for all upgrade tests to run. Test case zpool_upgrade_007_pos still fails and must be investigated so it has been left disabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4852 * Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837 * Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4846 * Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227 * Implementation of SSE optimized Fletcher-4 Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4789 * Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4818 * Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in 35a76a0. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4862 * Update arc_summary.py for prefetch changes Commit 7f60329 removed several kstats which arc_summary.py read. Remove these kstats from arc_summary.py in the same way this was handled in FreeNAS. FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4695 * Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4854 * Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4813 Issue #1392 * RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4860 * Fix NULL pointer in zfs_preumount from 1d9b3bd When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4867 Issue #4854 * Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329 * Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329 * zloop: print backtrace from core files Find the core file by using `/proc/sys/kernel/core_pattern` Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4874 * Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4267 * Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid|gid)_(read|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227 * Check whether the kernel supports i_uid/gid_read/write helpers Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227 * Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Fix sync behavior for disk vdevs Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4858 * Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on th…

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: openzfs#4726

vladki77 · 2016-11-24T11:22:45Z

Hi guys, I have run into similar problem with Debian 3.16.36-1+deb8u2, zfs-dkms 0.6.5.7-8-jessie.
I had to reboot as I could not access files on zfs at all. System itself was responsive as / and /var are on ext4. However rsync is running hourly and this happened first time after several months. Probably due to recent reboot to get new kernel and zfs updates (from 0.6.5.6). Now the system is running fine (knock-knock).

I would like to ask if you found some final solution - either by upgrade/downgrade or tweaking some zfs options. From reading the above posts I have the feeling that it was mostly a temporary solution and that the problem was back after a while. I'm about to upgrade to jessie-backports (0.6.5.8-1~bpo8+1)

Another question is whether your problem was "deterministic" i.e. every run of rsync caused problems, or just sometimes?

Thanks in advance for any hints.

jonathanvaughn closed this as completed Jun 6, 2016

jonathanvaughn reopened this Jun 6, 2016

dweeezil mentioned this issue Jun 17, 2016

4.6 Compat: Fall back to d_prune_aliases() if necessary #4769

Closed

behlendorf closed this as completed in 09fb30e Jun 17, 2016

behlendorf added this to the 0.7.0 milestone Jun 17, 2016

cserem mentioned this issue Nov 26, 2016

arc_prune eats the whole CPU core [0.6.5.4] #4345

Closed

ovaistariq mentioned this issue May 31, 2017

CPU lockup with arc_reclaim responsible for most of the CPU time #6187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arc_reclaim / arc_prune out of control after running long rsync #4726

arc_reclaim / arc_prune out of control after running long rsync #4726

jonathanvaughn commented Jun 4, 2016

dweeezil commented Jun 4, 2016

jonathanvaughn commented Jun 4, 2016 •

edited

Loading

jonathanvaughn commented Jun 4, 2016

dweeezil commented Jun 5, 2016

jonathanvaughn commented Jun 5, 2016

jonathanvaughn commented Jun 5, 2016 •

edited

Loading

dweeezil commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

dweeezil commented Jun 6, 2016

dweeezil commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016 •

edited

Loading

dweeezil commented Jun 6, 2016

dweeezil commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

jonathanvaughn commented Jun 7, 2016 •

edited

Loading

jonathanvaughn commented Jun 7, 2016

jonathanvaughn commented Jun 7, 2016

dweeezil commented Jun 8, 2016

jonathanvaughn commented Jun 8, 2016 •

edited

Loading

dweeezil commented Jun 8, 2016

jonathanvaughn commented Jun 8, 2016

jonathanvaughn commented Jun 9, 2016

jonathanvaughn commented Jun 10, 2016

dweeezil commented Jun 10, 2016

jonathanvaughn commented Jun 10, 2016

dweeezil commented Jun 16, 2016

vladki77 commented Nov 24, 2016 •

edited

Loading

arc_reclaim / arc_prune out of control after running long rsync #4726

arc_reclaim / arc_prune out of control after running long rsync #4726

Comments

jonathanvaughn commented Jun 4, 2016

dweeezil commented Jun 4, 2016

jonathanvaughn commented Jun 4, 2016 • edited Loading

jonathanvaughn commented Jun 4, 2016

dweeezil commented Jun 5, 2016

jonathanvaughn commented Jun 5, 2016

jonathanvaughn commented Jun 5, 2016 • edited Loading

dweeezil commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

dweeezil commented Jun 6, 2016

dweeezil commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016 • edited Loading

dweeezil commented Jun 6, 2016

dweeezil commented Jun 6, 2016

jonathanvaughn commented Jun 6, 2016

jonathanvaughn commented Jun 7, 2016 • edited Loading

jonathanvaughn commented Jun 7, 2016

jonathanvaughn commented Jun 7, 2016

dweeezil commented Jun 8, 2016

jonathanvaughn commented Jun 8, 2016 • edited Loading

dweeezil commented Jun 8, 2016

jonathanvaughn commented Jun 8, 2016

jonathanvaughn commented Jun 9, 2016

jonathanvaughn commented Jun 10, 2016

dweeezil commented Jun 10, 2016

jonathanvaughn commented Jun 10, 2016

dweeezil commented Jun 16, 2016

vladki77 commented Nov 24, 2016 • edited Loading

jonathanvaughn commented Jun 4, 2016 •

edited

Loading

jonathanvaughn commented Jun 5, 2016 •

edited

Loading

jonathanvaughn commented Jun 6, 2016 •

edited

Loading

jonathanvaughn commented Jun 7, 2016 •

edited

Loading

jonathanvaughn commented Jun 8, 2016 •

edited

Loading

vladki77 commented Nov 24, 2016 •

edited

Loading