Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added audio codec and touchscreen support for EDT device #2

Closed
wants to merge 0 commits into from
Closed

Conversation

a1d3s
Copy link

@a1d3s a1d3s commented Jun 5, 2015

No description provided.

@jwrdegoede jwrdegoede force-pushed the sunxi-wip branch 7 times, most recently from b3679ff to ee140d3 Compare June 10, 2015 14:44
@a1d3s a1d3s closed this Jun 11, 2015
jwrdegoede pushed a commit that referenced this pull request Jul 28, 2015
	Normally opening a file, unlinking it and then closing will have
the inode freed upon close() (provided that it's not otherwise busy and
has no remaining links, of course).  However, there's one case where that
does *not* happen.  Namely, if you open it by fhandle with cold dcache,
then unlink() and close().

	In normal case you get d_delete() in unlink(2) notice that dentry
is busy and unhash it; on the final dput() it will be forcibly evicted from
dcache, triggering iput() and inode removal.  In this case, though, we end
up with *two* dentries - disconnected (created by open-by-fhandle) and
regular one (used by unlink()).  The latter will have its reference to inode
dropped just fine, but the former will not - it's considered hashed (it
is on the ->s_anon list), so it will stay around until the memory pressure
will finally do it in.  As the result, we have the final iput() delayed
indefinitely.  It's trivial to reproduce -

void flush_dcache(void)
{
        system("mount -o remount,rw /");
}

static char buf[20 * 1024 * 1024];

main()
{
        int fd;
        union {
                struct file_handle f;
                char buf[MAX_HANDLE_SZ];
        } x;
        int m;

        x.f.handle_bytes = sizeof(x);
        chdir("/root");
        mkdir("foo", 0700);
        fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
        close(fd);
        name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
        flush_dcache();
        fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
        unlink("foo/bar");
        write(fd, buf, sizeof(buf));
        system("df .");			/* 20Mb eaten */
        close(fd);
        system("df .");			/* should've freed those 20Mb */
        flush_dcache();
        system("df .");			/* should be the same as #2 */
}

will spit out something like
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 303843      1131 100% /
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 303843      1131 100% /
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 283282     21692  93% /
- inode gets freed only when dentry is finally evicted (here we trigger
than by remount; normally it would've happened in response to memory
pressure hell knows when).

Cc: stable@vger.kernel.org # v2.6.38+; earlier ones need s/kill_it/unhash_it/
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
jwrdegoede pushed a commit that referenced this pull request Jul 31, 2015
[   68.196974] WARNING: CPU: 1 PID: 2140 at arch/x86/kvm/x86.c:3161 kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]()
[   68.196975] Modules linked in: snd_hda_codec_hdmi i915 rfcomm bnep bluetooth i2c_algo_bit rfkill nfsd drm_kms_helper nfs_acl nfs drm lockd grace sunrpc fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_dummy snd_seq_oss x86_pkg_temp_thermal snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi kvm snd_seq ghash_clmulni_intel fuse snd_timer aesni_intel parport_pc ablk_helper snd_seq_device cryptd ppdev snd lp parport lrw dcdbas gf128mul i2c_core glue_helper lpc_ich video shpchp mfd_core soundcore serio_raw acpi_cpufreq ext4 mbcache jbd2 sd_mod crc32c_intel ahci libahci libata e1000e ptp pps_core
[   68.197005] CPU: 1 PID: 2140 Comm: qemu-system-x86 Not tainted 4.2.0-rc1+ #2
[   68.197006] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
[   68.197007]  ffffffffa03b0657 ffff8800d984bca8 ffffffff815915a2 0000000000000000
[   68.197009]  0000000000000000 ffff8800d984bce8 ffffffff81057c0a 00007ff6d0001000
[   68.197010]  0000000000000002 ffff880211c1a000 0000000000000004 ffff8800ce0288c0
[   68.197012] Call Trace:
[   68.197017]  [<ffffffff815915a2>] dump_stack+0x45/0x57
[   68.197020]  [<ffffffff81057c0a>] warn_slowpath_common+0x8a/0xc0
[   68.197022]  [<ffffffff81057cfa>] warn_slowpath_null+0x1a/0x20
[   68.197029]  [<ffffffffa037bed8>] kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]
[   68.197035]  [<ffffffffa037aede>] ? kvm_arch_vcpu_load+0x4e/0x1c0 [kvm]
[   68.197040]  [<ffffffffa03696a6>] kvm_vcpu_ioctl+0xc6/0x5c0 [kvm]
[   68.197043]  [<ffffffff811252d2>] ? perf_pmu_enable+0x22/0x30
[   68.197044]  [<ffffffff8112663e>] ? perf_event_context_sched_in+0x7e/0xb0
[   68.197048]  [<ffffffff811a6882>] do_vfs_ioctl+0x2c2/0x4a0
[   68.197050]  [<ffffffff8107bf33>] ? finish_task_switch+0x173/0x220
[   68.197053]  [<ffffffff8123307f>] ? selinux_file_ioctl+0x4f/0xd0
[   68.197055]  [<ffffffff8122cac3>] ? security_file_ioctl+0x43/0x60
[   68.197057]  [<ffffffff811a6ad9>] SyS_ioctl+0x79/0x90
[   68.197060]  [<ffffffff81597e57>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   68.197061] ---[ end trace 558a5ebf9445fc80 ]---

After commit (0c4109b 'x86/fpu/xstate: Fix up bad get_xsave_addr()
assumptions'), there is no assumption an xsave bit is present in the
hardware (pcntxt_mask) that it is always present in a given xsave buffer.
An enabled state to be present on 'pcntxt_mask', but *not* in 'xstate_bv'
could happen when the last 'xsave' did not request that this feature be
saved (unlikely) or because the "init optimization" caused it to not be
saved. This patch kill the assumption.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
jwrdegoede pushed a commit that referenced this pull request Jul 31, 2015
Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.

For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test files. File foo has the same 2K of data at offset 4K
  # as file bar has at its offset 0.
  $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
      -c "pwrite -S 0xbb 4k 2K" \
      -c "pwrite -S 0xcc 8K 4K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # File bar consists of a single inline extent (2K size).
  $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
     $SCRATCH_MNT/bar | _filter_xfs_io

  # Now call the clone ioctl to clone the extent of file bar into file
  # foo at its offset 4K. This made file foo have an inline extent at
  # offset 4K, something which the btrfs code can not deal with in future
  # IO operations because all inline extents are supposed to start at an
  # offset of 0, resulting in all sorts of chaos.
  # So here we validate that clone ioctl returns an EOPNOTSUPP, which is
  # what it returns for other cases dealing with inlined extents.
  $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
      $SCRATCH_MNT/bar $SCRATCH_MNT/foo

  # Because of the inline extent at offset 4K, the following write made
  # the kernel crash with a BUG_ON().
  $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io

  status=0
  exit

The stack trace of the BUG_ON() triggered by the last write is:

  [152154.035903] ------------[ cut here ]------------
  [152154.036424] kernel BUG at mm/page-writeback.c:2286!
  [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$
  [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G        W       4.1.0-rc6-btrfs-next-11+ #2
  [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
  [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000
  [152154.036424] RIP: 0010:[<ffffffff8111a9d5>]  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424] RSP: 0018:ffff880429effc68  EFLAGS: 00010246
  [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001
  [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0
  [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001
  [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68
  [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80
  [152154.036424] FS:  00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000
  [152154.036424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0
  [152154.036424] Stack:
  [152154.036424]  ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1
  [152154.036424]  ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8
  [152154.036424]  0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000
  [152154.036424] Call Trace:
  [152154.036424]  [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs]
  [152154.036424]  [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs]
  [152154.036424]  [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs]
  [152154.036424]  [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
  [152154.036424]  [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs]
  [152154.036424]  [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5
  [152154.036424]  [<ffffffff81165f89>] vfs_write+0xa0/0xe4
  [152154.036424]  [<ffffffff81166855>] SyS_pwrite64+0x64/0x82
  [152154.036424]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
  [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$
  [152154.036424] RIP  [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90
  [152154.036424]  RSP <ffff880429effc68>
  [152154.242621] ---[ end trace e3d3376b23a57041 ]---

Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:

   00fdf13 ("Btrfs: fix a crash of clone with inline extents's split")
   3f9e3df ("btrfs: replace error code from btrfs_drop_extents")

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
jwrdegoede pushed a commit that referenced this pull request Jul 31, 2015
__ipoib_ib_dev_flush calls itself recursively on child devices, and lockdep
complains about locking vlan_rwsem twice (see below). Use down_read_nested
instead of down_read to prevent the warning.

 =============================================
 [ INFO: possible recursive locking detected ]
 4.1.0-rc4+ linux-sunxi#36 Tainted: G           O
 ---------------------------------------------
 kworker/u20:2/261 is trying to acquire lock:
  (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]

 but task is already holding lock:
  (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&priv->vlan_rwsem);
   lock(&priv->vlan_rwsem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 3 locks held by kworker/u20:2/261:
  #0:  ("%s""ipoib_flush"){.+.+..}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760
  #1:  ((&priv->flush_heavy)){+.+...}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760
  #2:  (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]

 stack backtrace:
 CPU: 3 PID: 261 Comm: kworker/u20:2 Tainted: G           O    4.1.0-rc4+ linux-sunxi#36
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
 Workqueue: ipoib_flush ipoib_ib_dev_flush_heavy [ib_ipoib]
  ffff8801c6c54790 ffff8801c9927af8 ffffffff81665238 0000000000000001
  ffffffff825b5b30 ffff8801c9927bd8 ffffffff810bba51 ffff880100000000
  ffffffff00000001 ffff880100000001 ffff8801c6c55428 ffff8801c6c54790
 Call Trace:
  [<ffffffff81665238>] dump_stack+0x4f/0x6f
  [<ffffffff810bba51>] __lock_acquire+0x741/0x1820
  [<ffffffff810bcbf8>] lock_acquire+0xc8/0x240
  [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
  [<ffffffff81669d2c>] down_read+0x4c/0x70
  [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
  [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
  [<ffffffffa0791e4a>] __ipoib_ib_dev_flush+0x5a/0x2b0 [ib_ipoib]
  [<ffffffffa07920ba>] ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib]
  [<ffffffff81082871>] process_one_work+0x201/0x760
  [<ffffffff810827cc>] ? process_one_work+0x15c/0x760
  [<ffffffff81082ef0>] worker_thread+0x120/0x4d0
  [<ffffffff81082dd0>] ? process_one_work+0x760/0x760
  [<ffffffff81082dd0>] ? process_one_work+0x760/0x760
  [<ffffffff81088b7e>] kthread+0xfe/0x120
  [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70
  [<ffffffff8166c6e2>] ret_from_fork+0x42/0x70
  [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
jwrdegoede pushed a commit that referenced this pull request Jul 31, 2015
Kirill A. Shutemov says:

This simple test-case trigers few locking asserts in kernel:

int main(int argc, char **argv)
{
        unsigned int block_size = 16 * 4096;
        struct nl_mmap_req req = {
                .nm_block_size          = block_size,
                .nm_block_nr            = 64,
                .nm_frame_size          = 16384,
                .nm_frame_nr            = 64 * block_size / 16384,
        };
        unsigned int ring_size;
	int fd;

	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
                exit(1);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
                exit(1);

	ring_size = req.nm_block_nr * req.nm_block_size;
	mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	return 0;
}

+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
 #0:  (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220
 #1:  ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70
 #2:  (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20

CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 linux-sunxi#253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
 <IRQ>  [<ffffffff81929ceb>] dump_stack+0x4f/0x7b
 [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270
 [<ffffffff81085bed>] __might_sleep+0x4d/0x90
 [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430
 [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
 [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20
 [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350
 [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70
 [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150
 [<ffffffff817e484d>] __sk_free+0x1d/0x160
 [<ffffffff817e49a9>] sk_free+0x19/0x20
[..]

Cong Wang says:

We can't hold mutex lock in a rcu callback, [..]

Thomas Graf says:

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.

Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Diagnosed-by: Cong Wang <cwang@twopensource.com>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
jwrdegoede pushed a commit that referenced this pull request Aug 14, 2015
The probe error path for this driver, for all intents and purposes,
is the talitos_remove() function due to the common "goto err_out".

Without this patch applied, talitos_remove() will panic under these
two conditions:

1. If the RNG device hasn't been registered via
   talitos_register_rng() prior to entry into talitos_remove(),
   then the attempt to unregister the RNG "device" will cause a panic.

2. If the priv->chan array has not been allocated prior to entry
   into talitos_remove(), then the per-channel FIFO cleanup will panic
   because of the dereference of that NULL "array".

Both of the above scenarios occur if talitos_probe_irq() fails.

This patch resolves issue #1 by introducing a boolean to mask the
hwrng_unregister() call in talitos_unregister_rng() if RNG device
registration was unsuccessful.

It resolves issue #2 by checking that priv->chan is not NULL in the
per-channel FIFO cleanup for loop.

Signed-off-by: Aaron Sierra <asierra@xes-inc.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
In dev_queue_xmit() net_cls protected with rcu-bh.

[  270.730026] ===============================
[  270.730029] [ INFO: suspicious RCU usage. ]
[  270.730033] 4.2.0-rc3+ #2 Not tainted
[  270.730036] -------------------------------
[  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[  270.730041] other info that might help us debug this:
[  270.730043] rcu_scheduler_active = 1, debug_locks = 1
[  270.730045] 2 locks held by dhclient/748:
[  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960
[  270.730085]  #1:  (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960
[  270.730090] stack backtrace:
[  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[  270.730108] Call Trace:
[  270.730121]  [<ffffffff817ad487>] dump_stack+0x4c/0x65
[  270.730148]  [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120
[  270.730153]  [<ffffffff816a62d2>] task_cls_state+0x92/0xa0
[  270.730158]  [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[  270.730164]  [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0
[  270.730166]  [<ffffffff816ab573>] tc_classify+0x33/0x90
[  270.730170]  [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb]
[  270.730172]  [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960
[  270.730174]  [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960
[  270.730176]  [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20
[  270.730185]  [<ffffffff81787770>] dev_queue_xmit+0x10/0x20
[  270.730187]  [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760
[  270.730190]  [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0
[  270.730203]  [<ffffffff81665245>] ? sock_def_readable+0x5/0x190
[  270.730210]  [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40
[  270.730216]  [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640
[  270.730219]  [<ffffffff8165f367>] sock_sendmsg+0x47/0x50
[  270.730221]  [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0
[  270.730232]  [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0
[  270.730234]  [<ffffffff811fe5b8>] vfs_write+0xb8/0x190
[  270.730236]  [<ffffffff811fe8c2>] SyS_write+0x52/0xb0
[  270.730239]  [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
The cgroup attaches inode->i_wb via mark_inode_dirty and when set_page_writeback
is called, __inc_wb_stat() updates i_wb's stat.

So, we need to explicitly call set_page_dirty->__mark_inode_dirty in prior to
any writebacking pages.

This patch should resolve the following kernel panic reported by Andreas Reis.

https://bugzilla.kernel.org/show_bug.cgi?id=101801

--- Comment #2 from Andreas Reis <andreas.reis@gmail.com> ---
BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90
PGD 2951ff067 PUD 2df43f067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 7 PID: 10356 Comm: gcc Tainted: G        W       4.2.0-1-cu #1
Hardware name: Gigabyte Technology Co., Ltd. G1.Sniper M5/G1.Sniper M5, BIOS
T01 02/03/2015
task: ffff880295044f80 ti: ffff880295140000 task.ti: ffff880295140000
RIP: 0010:[<ffffffff8149deea>]  [<ffffffff8149deea>]
__percpu_counter_add+0x1a/0x90
RSP: 0018:ffff880295143ac8  EFLAGS: 00010082
RAX: 0000000000000003 RBX: ffffea000a526d40 RCX: 0000000000000001
RDX: 0000000000000020 RSI: 0000000000000001 RDI: 0000000000000088
RBP: ffff880295143ae8 R08: 0000000000000000 R09: ffff88008f69bb30
R10: 00000000fffffffa R11: 0000000000000000 R12: 0000000000000088
R13: 0000000000000001 R14: ffff88041d099000 R15: ffff880084a205d0
FS:  00007f8549374700(0000) GS:ffff88042f3c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000a8 CR3: 000000033e1d5000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 0000000000000000 ffffea000a526d40 ffff880084a20738 ffff880084a20750
 ffff880295143b48 ffffffff811cc91e ffff880000000000 0000000000000296
 0000000000000000 ffff880417090198 0000000000000000 ffffea000a526d40
Call Trace:
 [<ffffffff811cc91e>] __test_set_page_writeback+0xde/0x1d0
 [<ffffffff813fee87>] do_write_data_page+0xe7/0x3a0
 [<ffffffff813faeea>] gc_data_segment+0x5aa/0x640
 [<ffffffff813fb0b8>] do_garbage_collect+0x138/0x150
 [<ffffffff813fb3fe>] f2fs_gc+0x1be/0x3e0
 [<ffffffff81405541>] f2fs_balance_fs+0x81/0x90
 [<ffffffff813ee357>] f2fs_unlink+0x47/0x1d0
 [<ffffffff81239329>] vfs_unlink+0x109/0x1b0
 [<ffffffff8123e3d7>] do_unlinkat+0x287/0x2c0
 [<ffffffff8123ebc6>] SyS_unlink+0x16/0x20
 [<ffffffff81942e2e>] entry_SYSCALL_64_fastpath+0x12/0x71
Code: 41 5e 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 49
89 f5 41 54 49 89 fc 53 48 83 ec 08 65 ff 05 e6 d9 b6 7e <48> 8b 47 20 48 63 ca
65 8b 18 48 63 db 48 01 f3 48 39 cb 7d 0a
RIP  [<ffffffff8149deea>] __percpu_counter_add+0x1a/0x90
 RSP <ffff880295143ac8>
CR2: 00000000000000a8
---[ end trace 5132449a58ed93a3 ]---
note: gcc[10356] exited with preempt_count 2

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
Hayes Wang says:

====================
r8152: issues fix

v2:
Replace patch #2 with "r8152: fix wakeup settings".

v1:
These patches are used to fix issues.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
Hayes Wang says:

====================
r8152: device reset

v3:
For patch #2, remove cancel_delayed_work().

v2:
For patch #1, remove usb_autopm_get_interface(), usb_autopm_put_interface(), and
the checking of intf->condition.

For patch #2, replace the original method with usb_queue_reset_device() to reset
the device.

v1:
Although the driver works normally, we find the device may get all 0xff data when
transmitting packets on certain platforms. It would break the device and no packet
could be transmitted. The reset is necessary to recover the hw for this situation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  linux-sunxi#7 shrink_page_list at ffffffff81135445
  linux-sunxi#8 shrink_inactive_list at ffffffff81135845
  linux-sunxi#9 shrink_lruvec at ffffffff81135ead
 linux-sunxi#10 shrink_zone at ffffffff811360c3
 linux-sunxi#11 shrink_zones at ffffffff81136eff
 linux-sunxi#12 do_try_to_free_pages at ffffffff8113712f
 linux-sunxi#13 try_to_free_mem_cgroup_pages at ffffffff811372be
 linux-sunxi#14 try_charge at ffffffff81189423
 linux-sunxi#15 mem_cgroup_try_charge at ffffffff8118c6f5
 linux-sunxi#16 __add_to_page_cache_locked at ffffffff8112137d
 linux-sunxi#17 add_to_page_cache_lru at ffffffff81121618
 linux-sunxi#18 pagecache_get_page at ffffffff8112170b
 linux-sunxi#19 grow_dev_page at ffffffff811c8297
 linux-sunxi#20 __getblk_slow at ffffffff811c91d6
 linux-sunxi#21 __getblk_gfp at ffffffff811c92c1
 linux-sunxi#22 ext4_ext_grow_indepth at ffffffff8124565c
 linux-sunxi#23 ext4_ext_create_new_leaf at ffffffff81246ca8
 linux-sunxi#24 ext4_ext_insert_extent at ffffffff81246f09
 linux-sunxi#25 ext4_ext_map_blocks at ffffffff8124a848
 linux-sunxi#26 ext4_map_blocks at ffffffff8121a5b7
 linux-sunxi#27 mpage_map_one_extent at ffffffff8121b1fa
 linux-sunxi#28 mpage_map_and_submit_extent at ffffffff8121f07b
 linux-sunxi#29 ext4_writepages at ffffffff8121f6d5
 linux-sunxi#30 do_writepages at ffffffff8112c490
 linux-sunxi#31 __filemap_fdatawrite_range at ffffffff81120199
 linux-sunxi#32 filemap_flush at ffffffff8112041c
 linux-sunxi#33 ext4_alloc_da_blocks at ffffffff81219da1
 linux-sunxi#34 ext4_rename at ffffffff81229b91
 linux-sunxi#35 ext4_rename2 at ffffffff81229e32
 linux-sunxi#36 vfs_rename at ffffffff811a08a5
 linux-sunxi#37 SYSC_renameat2 at ffffffff811a3ffc
 linux-sunxi#38 sys_renameat2 at ffffffff811a408e
 linux-sunxi#39 sys_rename at ffffffff8119e51e
 linux-sunxi#40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
The shm implementation internally uses shmem or hugetlbfs inodes for shm
segments.  As these inodes are never directly exposed to userspace and
only accessed through the shm operations which are already hooked by
security modules, mark the inodes with the S_PRIVATE flag so that inode
security initialization and permission checking is skipped.

This was motivated by the following lockdep warning:

  ======================================================
   [ INFO: possible circular locking dependency detected ]
   4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
  -------------------------------------------------------
   httpd/1597 is trying to acquire lock:
   (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
   but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
   which lock already depends on the new lock.
   the existing dependency chain (in reverse order) is:
   -> #3 (&mm->mmap_sem){++++++}:
        lock_acquire+0xc7/0x270
        __might_fault+0x7a/0xa0
        filldir+0x9e/0x130
        xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
        xfs_readdir+0x1b4/0x330 [xfs]
        xfs_file_readdir+0x2b/0x30 [xfs]
        iterate_dir+0x97/0x130
        SyS_getdents+0x91/0x120
        entry_SYSCALL_64_fastpath+0x12/0x76
   -> #2 (&xfs_dir_ilock_class){++++.+}:
        lock_acquire+0xc7/0x270
        down_read_nested+0x57/0xa0
        xfs_ilock+0x167/0x350 [xfs]
        xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
        xfs_attr_get+0xbd/0x190 [xfs]
        xfs_xattr_get+0x3d/0x70 [xfs]
        generic_getxattr+0x4f/0x70
        inode_doinit_with_dentry+0x162/0x670
        sb_finish_set_opts+0xd9/0x230
        selinux_set_mnt_opts+0x35c/0x660
        superblock_doinit+0x77/0xf0
        delayed_superblock_init+0x10/0x20
        iterate_supers+0xb3/0x110
        selinux_complete_init+0x2f/0x40
        security_load_policy+0x103/0x600
        sel_write_load+0xc1/0x750
        __vfs_write+0x37/0x100
        vfs_write+0xa9/0x1a0
        SyS_write+0x58/0xd0
        entry_SYSCALL_64_fastpath+0x12/0x76
  ...

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Morten Stevens <mstevens@fedoraproject.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jwrdegoede pushed a commit that referenced this pull request Aug 23, 2015
It turns out that a PV domU also requires the "Xen PV" APIC
driver. Otherwise, the flat driver is used and we get stuck in busy
loops that never exit, such as in this stack trace:

(gdb) target remote localhost:9999
Remote debugging using localhost:9999
__xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
56              while (native_apic_mem_read(APIC_ICR) & APIC_ICR_BUSY)
(gdb) bt
 #0  __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
 #1  __default_send_IPI_shortcut (shortcut=<optimized out>,
dest=<optimized out>, vector=<optimized out>) at
./arch/x86/include/asm/ipi.h:75
 #2  apic_send_IPI_self (vector=246) at arch/x86/kernel/apic/probe_64.c:54
 #3  0xffffffff81011336 in arch_irq_work_raise () at
arch/x86/kernel/irq_work.c:47
 #4  0xffffffff8114990c in irq_work_queue (work=0xffff88000fc0e400) at
kernel/irq_work.c:100
 #5  0xffffffff8110c29d in wake_up_klogd () at kernel/printk/printk.c:2633
 #6  0xffffffff8110ca60 in vprintk_emit (facility=0, level=<optimized
out>, dict=0x0 <irq_stack_union>, dictlen=<optimized out>,
fmt=<optimized out>, args=<optimized out>)
    at kernel/printk/printk.c:1778
 linux-sunxi#7  0xffffffff816010c8 in printk (fmt=<optimized out>) at
kernel/printk/printk.c:1868
 linux-sunxi#8  0xffffffffc00013ea in ?? ()
 linux-sunxi#9  0x0000000000000000 in ?? ()

Mailing-list-thread: https://lkml.org/lkml/2015/8/4/755
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
jwrdegoede pushed a commit that referenced this pull request Sep 29, 2015
Kernel testing triggered this warning:

| WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
| Modules linked in:
| CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c7 #2
| Call Trace:
|   dump_stack+0x4b/0x75
|   warn_slowpath_common+0x8b/0xc0
|   warn_slowpath_null+0x22/0x30
|   do_set_cpus_allowed+0x7e/0x80
|   cpuset_cpus_allowed_fallback+0x7c/0x170
|   select_fallback_rq+0x221/0x280
|   migration_call+0xe3/0x250
|   notifier_call_chain+0x53/0x70
|   __raw_notifier_call_chain+0x1e/0x30
|   cpu_notify+0x28/0x50
|   take_cpu_down+0x22/0x40
|   multi_cpu_stop+0xd5/0x140
|   cpu_stopper_thread+0xbc/0x170
|   smpboot_thread_fn+0x174/0x2f0
|   kthread+0xc4/0xe0
|   ret_from_kernel_thread+0x21/0x30

As Peterz pointed out:

| So the normal rules for changing task_struct::cpus_allowed are holding
| both pi_lock and rq->lock, such that holding either stabilizes the mask.
|
| This is so that wakeup can happen without rq->lock and load-balance
| without pi_lock.
|
| From this we already get the relaxation that we can omit acquiring
| rq->lock if the task is not on the rq, because in that case
| load-balancing will not apply to it.
|
| ** these are the rules currently tested in do_set_cpus_allowed() **
|
| Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
| unconditionally acquires both locks, we could get away with holding just
| rq->lock when on_rq for modification because that'd still exclude
| __set_cpus_allowed_ptr(), it would also work against
| __kthread_bind_mask() because that assumes !on_rq.
|
| That said, this is all somewhat fragile.
|
| Now, I don't think dropping rq->lock is quite as disastrous as it
| usually is because !cpu_active at this point, which means load-balance
| will not interfere, but that too is somewhat fragile.
|
| So we end up with a choice of two fragile..

This patch fixes it by following the rules for changing
task_struct::cpus_allowed with both pi_lock and rq->lock held.

Reported-by: kernel test robot <ying.huang@intel.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Modified changelog and patch. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gbl
Signed-off-by: Ingo Molnar <mingo@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Sep 29, 2015
The renesas-irqc interrupt controller is cascaded to the GIC. Hence when
propagating wake-up settings to its parent interrupt controller, the
following lockdep warning is printed:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.2.0-ape6evm-10725-g50fcd7643c034198 linux-sunxi#280 Not tainted
    ---------------------------------------------
    s2ram/1072 is trying to acquire lock:
    (&irq_desc_lock_class){-.-...}, at: [<c008d3fc>] __irq_get_desc_lock+0x58/0x98

    but task is already holding lock:
    (&irq_desc_lock_class){-.-...}, at: [<c008d3fc>] __irq_get_desc_lock+0x58/0x98

    other info that might help us debug this:
    Possible unsafe locking scenario:

	  CPU0
	  ----
     lock(&irq_desc_lock_class);
     lock(&irq_desc_lock_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    6 locks held by s2ram/1072:
    #0:  (sb_writers#7){.+.+.+}, at: [<c012eb14>] __sb_start_write+0xa0/0xa8
    #1:  (&of->mutex){+.+.+.}, at: [<c019396c>] kernfs_fop_write+0x4c/0x1bc
    #2:  (s_active#24){.+.+.+}, at: [<c0193974>] kernfs_fop_write+0x54/0x1bc
    #3:  (pm_mutex){+.+.+.}, at: [<c008213c>] pm_suspend+0x10c/0x510
    #4:  (&dev->mutex){......}, at: [<c02af3c4>] __device_suspend+0xdc/0x2cc
    #5:  (&irq_desc_lock_class){-.-...}, at: [<c008d3fc>] __irq_get_desc_lock+0x58/0x98

    stack backtrace:
    CPU: 0 PID: 1072 Comm: s2ram Not tainted 4.2.0-ape6evm-10725-g50fcd7643c034198 linux-sunxi#280
    Hardware name: Generic R8A73A4 (Flattened Device Tree)
    [<c0018078>] (unwind_backtrace) from [<c00144f0>] (show_stack+0x10/0x14)
    [<c00144f0>] (show_stack) from [<c0451f14>] (dump_stack+0x88/0x98)
    [<c0451f14>] (dump_stack) from [<c007b29c>] (__lock_acquire+0x15cc/0x20e4)
    [<c007b29c>] (__lock_acquire) from [<c007c6e0>] (lock_acquire+0xac/0x12c)
    [<c007c6e0>] (lock_acquire) from [<c0457c00>] (_raw_spin_lock_irqsave+0x40/0x54)
    [<c0457c00>] (_raw_spin_lock_irqsave) from [<c008d3fc>] (__irq_get_desc_lock+0x58/0x98)
    [<c008d3fc>] (__irq_get_desc_lock) from [<c008ebbc>] (irq_set_irq_wake+0x20/0xf8)
    [<c008ebbc>] (irq_set_irq_wake) from [<c0260770>] (irqc_irq_set_wake+0x20/0x4c)
    [<c0260770>] (irqc_irq_set_wake) from [<c008ec28>] (irq_set_irq_wake+0x8c/0xf8)
    [<c008ec28>] (irq_set_irq_wake) from [<c02cb8c0>] (gpio_keys_suspend+0x74/0xc0)
    [<c02cb8c0>] (gpio_keys_suspend) from [<c02ae8cc>] (dpm_run_callback+0x54/0x124)

Avoid this false positive by using a separate lockdep class for IRQC
interrupts.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1441798974-25716-2-git-send-email-geert%2Brenesas@glider.be
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
jwrdegoede pushed a commit that referenced this pull request Sep 29, 2015
The renesas-intc-irqpin interrupt controller is cascaded to the GIC.
Hence when propagating wake-up settings to its parent interrupt
controller, the following lockdep warning is printed:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.2.0-armadillo-10725-g50fcd7643c034198 torvalds#781 Not tainted
    ---------------------------------------------
    s2ram/1179 is trying to acquire lock:
    (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94

    but task is already holding lock:
    (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94

    other info that might help us debug this:
    Possible unsafe locking scenario:

	  CPU0
	  ----
     lock(&irq_desc_lock_class);
     lock(&irq_desc_lock_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    7 locks held by s2ram/1179:
    #0:  (sb_writers#7){.+.+.+}, at: [<c00c9708>] __sb_start_write+0x64/0xb8
    #1:  (&of->mutex){+.+.+.}, at: [<c0125a00>] kernfs_fop_write+0x78/0x1a0
    #2:  (s_active#23){.+.+.+}, at: [<c0125a08>] kernfs_fop_write+0x80/0x1a0
    #3:  (autosleep_lock){+.+.+.}, at: [<c0058244>] pm_autosleep_lock+0x18/0x20
    #4:  (pm_mutex){+.+.+.}, at: [<c0057e50>] pm_suspend+0x54/0x248
    #5:  (&dev->mutex){......}, at: [<c0243a20>] __device_suspend+0xdc/0x240
    #6:  (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94

    stack backtrace:
    CPU: 0 PID: 1179 Comm: s2ram Not tainted 4.2.0-armadillo-10725-g50fcd7643c034198

    Hardware name: Generic R8A7740 (Flattened Device Tree)
    [<c00129f4>] (dump_backtrace) from [<c0012bec>] (show_stack+0x18/0x1c)
    [<c0012bd4>] (show_stack) from [<c03f5d94>] (dump_stack+0x20/0x28)
    [<c03f5d74>] (dump_stack) from [<c00514d4>] (__lock_acquire+0x67c/0x1b88)
    [<c0050e58>] (__lock_acquire) from [<c0052df8>] (lock_acquire+0x9c/0xbc)
    [<c0052d5c>] (lock_acquire) from [<c03fb068>] (_raw_spin_lock_irqsave+0x44/0x58)
    [<c03fb024>] (_raw_spin_lock_irqsave) from [<c005bb54>] (__irq_get_desc_lock+0x78/0x94
    [<c005badc>] (__irq_get_desc_lock) from [<c005c3d8>] (irq_set_irq_wake+0x28/0x100)
    [<c005c3b0>] (irq_set_irq_wake) from [<c01e50d0>] (intc_irqpin_irq_set_wake+0x24/0x4c)
    [<c01e50ac>] (intc_irqpin_irq_set_wake) from [<c005c17c>] (set_irq_wake_real+0x3c/0x50
    [<c005c140>] (set_irq_wake_real) from [<c005c414>] (irq_set_irq_wake+0x64/0x100)
    [<c005c3b0>] (irq_set_irq_wake) from [<c02a19b4>] (gpio_keys_suspend+0x60/0xa0)
    [<c02a1954>] (gpio_keys_suspend) from [<c023b750>] (platform_pm_suspend+0x3c/0x5c)

Avoid this false positive by using a separate lockdep class for INTC
External IRQ Pin interrupts.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1441798974-25716-3-git-send-email-geert%2Brenesas@glider.be
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
jwrdegoede pushed a commit that referenced this pull request Sep 29, 2015
The OPP list needs to be protected against concurrent accesses. Using
simple RCU read locks does the trick and gets rid of the following
lockdep warning:

	===============================
	[ INFO: suspicious RCU usage. ]
	4.2.0-next-20150908 #1 Not tainted
	-------------------------------
	drivers/base/power/opp.c:460 Missing rcu_read_lock() or dev_opp_list_lock protection!

	other info that might help us debug this:

	rcu_scheduler_active = 1, debug_locks = 0
	4 locks held by kworker/u8:0/6:
	 #0:  ("%s""deferwq"){++++.+}, at: [<c0040d8c>] process_one_work+0x118/0x4bc
	 #1:  (deferred_probe_work){+.+.+.}, at: [<c0040d8c>] process_one_work+0x118/0x4bc
	 #2:  (&dev->mutex){......}, at: [<c03b8194>] __device_attach+0x20/0x118
	 #3:  (prepare_lock){+.+...}, at: [<c054bc08>] clk_prepare_lock+0x10/0xf8

	stack backtrace:
	CPU: 2 PID: 6 Comm: kworker/u8:0 Not tainted 4.2.0-next-20150908 #1
	Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
	Workqueue: deferwq deferred_probe_work_func
	[<c001802c>] (unwind_backtrace) from [<c00135a4>] (show_stack+0x10/0x14)
	[<c00135a4>] (show_stack) from [<c02a8418>] (dump_stack+0x94/0xd4)
	[<c02a8418>] (dump_stack) from [<c03c6f6c>] (dev_pm_opp_find_freq_ceil+0x108/0x114)
	[<c03c6f6c>] (dev_pm_opp_find_freq_ceil) from [<c0551a3c>] (dfll_calculate_rate_request+0xb8/0x170)
	[<c0551a3c>] (dfll_calculate_rate_request) from [<c0551b10>] (dfll_clk_round_rate+0x1c/0x2c)
	[<c0551b10>] (dfll_clk_round_rate) from [<c054de2c>] (clk_calc_new_rates+0x1b8/0x228)
	[<c054de2c>] (clk_calc_new_rates) from [<c054e44c>] (clk_core_set_rate_nolock+0x44/0xac)
	[<c054e44c>] (clk_core_set_rate_nolock) from [<c054e4d8>] (clk_set_rate+0x24/0x34)
	[<c054e4d8>] (clk_set_rate) from [<c0512460>] (tegra124_cpufreq_probe+0x120/0x230)
	[<c0512460>] (tegra124_cpufreq_probe) from [<c03b9cbc>] (platform_drv_probe+0x44/0xac)
	[<c03b9cbc>] (platform_drv_probe) from [<c03b84c8>] (driver_probe_device+0x218/0x304)
	[<c03b84c8>] (driver_probe_device) from [<c03b69b0>] (bus_for_each_drv+0x60/0x94)
	[<c03b69b0>] (bus_for_each_drv) from [<c03b8228>] (__device_attach+0xb4/0x118)
	ata1: SATA link down (SStatus 0 SControl 300)
	[<c03b8228>] (__device_attach) from [<c03b77c8>] (bus_probe_device+0x88/0x90)
	[<c03b77c8>] (bus_probe_device) from [<c03b7be8>] (deferred_probe_work_func+0x58/0x8c)
	[<c03b7be8>] (deferred_probe_work_func) from [<c0040dfc>] (process_one_work+0x188/0x4bc)
	[<c0040dfc>] (process_one_work) from [<c004117c>] (worker_thread+0x4c/0x4f4)
	[<c004117c>] (worker_thread) from [<c0047230>] (kthread+0xe4/0xf8)
	[<c0047230>] (kthread) from [<c000f7d0>] (ret_from_fork+0x14/0x24)

Signed-off-by: Thierry Reding <treding@nvidia.com>
Fixes: c4fe70a ("clk: tegra: Add closed loop support for the DFLL")
[vince.h@nvidia.com: Unlock rcu on error path]
Signed-off-by: Vince Hsu <vince.h@nvidia.com>
[sboyd@codeaurora.org: Dropped second hunk that nested the rcu
read lock unnecessarily]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
jwrdegoede pushed a commit that referenced this pull request Oct 13, 2015
…t initialized.

In case something goes wrong with power well initialization we were calling
intel_prepare_ddi during boot while encoder list isnt't initilized.

[    9.618747] i915 0000:00:02.0: Invalid ROM contents
[    9.631446] [drm] failed to find VBIOS tables
[    9.720036] BUG: unable to handle kernel NULL pointer dereference at 00000000
00000058
[    9.721986] IP: [<ffffffffa014eb72>] ddi_get_encoder_port+0x82/0x190 [i915]
[    9.723736] PGD 0
[    9.724286] Oops: 0000 [#1] PREEMPT SMP
[    9.725386] Modules linked in: intel_powerclamp snd_hda_intel(+) coretemp crc
32c_intel snd_hda_codec snd_hda_core serio_raw snd_pcm snd_timer i915(+) parport
_pc parport pinctrl_sunrisepoint pinctrl_intel nfsd nfs_acl
[    9.730635] CPU: 0 PID: 497 Comm: systemd-udevd Not tainted 4.3.0-rc2-eywa-10
967-g72de2cfd-dirty #2
[    9.732785] Hardware name: Intel Corporation Cannonlake Client platform/Skyla
ke DT DDR4 RVP8, BIOS CNLSE2R1.R00.X021.B00.1508040310 08/04/2015
[    9.735785] task: ffff88008a704700 ti: ffff88016a1ac000 task.ti: ffff88016a1a
c000
[    9.737584] RIP: 0010:[<ffffffffa014eb72>]  [<ffffffffa014eb72>] ddi_get_enco
der_port+0x82/0x190 [i915]
[    9.739934] RSP: 0000:ffff88016a1af710  EFLAGS: 00010296
[    9.741184] RAX: 000000000000004e RBX: ffff88008a9edc98 RCX: 0000000000000001
[    9.742934] RDX: 000000000000004e RSI: ffffffff81fc1e82 RDI: 00000000ffffffff
[    9.744634] RBP: ffff88016a1af730 R08: 0000000000000000 R09: 0000000000000578
[    9.746333] R10: 0000000000001065 R11: 0000000000000578 R12: fffffffffffffff8
[    9.748033] R13: ffff88016a1af7a8 R14: ffff88016a1af794 R15: 0000000000000000
[    9.749733] FS:  00007eff2e1e07c0(0000) GS:ffff88016fc00000(0000) knlGS:00000
00000000000
[    9.751683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.753083] CR2: 0000000000000058 CR3: 000000016922b000 CR4: 00000000003406f0
[    9.754782] Stack:
[    9.755332]  ffff88008a9edc98 ffff88008a9ed800 ffffffffa01d07b0 00000000fffb9
09e
[    9.757232]  ffff88016a1af7d8 ffffffffa0154ea7 0000000000000246 ffff88016a370
080
[    9.759182]  ffff88016a370080 ffff88008a9ed800 0000000000000246 ffff88008a9ed
c98
[    9.761132] Call Trace:
[    9.761782]  [<ffffffffa0154ea7>] intel_prepare_ddi+0x67/0x860 [i915]
[    9.763332]  [<ffffffff81a56996>] ? _raw_spin_unlock_irqrestore+0x26/0x40
[    9.765031]  [<ffffffffa00fad01>] ? gen9_read32+0x141/0x360 [i915]
[    9.766531]  [<ffffffffa00b43e1>] skl_set_power_well+0x431/0xa80 [i915]
[    9.768181]  [<ffffffffa00b4a63>] skl_power_well_enable+0x13/0x20 [i915]
[    9.769781]  [<ffffffffa00b2188>] intel_power_well_enable+0x28/0x50 [i915]
[    9.771481]  [<ffffffffa00b4d52>] intel_display_power_get+0x92/0xc0 [i915]
[    9.773180]  [<ffffffffa00b4fcb>] intel_display_set_init_power+0x3b/0x40 [i91
5]
[    9.774980]  [<ffffffffa00b5170>] intel_power_domains_init_hw+0x120/0x520 [i9
15]
[    9.776780]  [<ffffffffa0194c61>] i915_driver_load+0xb21/0xf40 [i915]

So let's protect this case.

My first attempt was to remove the intel_prepare_ddi, but Daniel had pointed out
this is really needed to restore those registers values. And Imre pointed out
that this case was without the flag protection and this was actually where things
were going bad. So I've just checked and this indeed solves my issue.

The regressing intel_prepare_ddi call was added in

commit 1d2b952
Author: Damien Lespiau <damien.lespiau@intel.com>
Date:   Fri Mar 6 18:50:53 2015 +0000

    drm/i915/skl: Restore the DDI translation tables when enabling PW1

Cc: Imre Deak <imre.deak@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Imre Deak <imre.deak@intel.com>
[Jani: regression reference]
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
jwrdegoede pushed a commit that referenced this pull request Oct 13, 2015
Dmitry Vyukov reported the following using trinity and the memory
error detector AddressSanitizer
(https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel).

[ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on
address ffff88002e280000
[ 124.576801] ffff88002e280000 is located 131938492886538 bytes to
the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a)
[ 124.578633] Accessed by thread T10915:
[ 124.579295] inlined in describe_heap_address
./arch/x86/mm/asan/report.c:164
[ 124.579295] #0 ffffffff810dd277 in asan_report_error
./arch/x86/mm/asan/report.c:278
[ 124.580137] #1 ffffffff810dc6a0 in asan_check_region
./arch/x86/mm/asan/asan.c:37
[ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0
[ 124.581893] #3 ffffffff8107c093 in get_wchan
./arch/x86/kernel/process_64.c:444

The address checks in the 64bit implementation of get_wchan() are
wrong in several ways:

 - The lower bound of the stack is not the start of the stack
   page. It's the start of the stack page plus sizeof (struct
   thread_info)

 - The upper bound must be:

       top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long).

   The 2 * sizeof(unsigned long) is required because the stack pointer
   points at the frame pointer. The layout on the stack is: ... IP FP
   ... IP FP. So we need to make sure that both IP and FP are in the
   bounds.

Fix the bound checks and get rid of the mix of numeric constants, u64
and unsigned long. Making all unsigned long allows us to use the same
function for 32bit as well.

Use READ_ONCE() when accessing the stack. This does not prevent a
concurrent wakeup of the task and the stack changing, but at least it
avoids TOCTOU.

Also check task state at the end of the loop. Again that does not
prevent concurrent changes, but it avoids walking for nothing.

Add proper comments while at it.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Based-on-patch-from: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
jwrdegoede pushed a commit that referenced this pull request Oct 13, 2015
Commit 1a3d595 ("MIPS: Tidy up FPU context switching") removed FP
context saving from the asm-written resume function in favour of reusing
existing code to perform the same task. However it only removed the FP
context saving code from the r4k_switch.S implementation of resume.
Octeon uses its own implementation in octeon_switch.S, so remove FP
context saving there too in order to prevent attempting to save context
twice. That formerly led to an exception from the second save as follows
because the FPU had already been disabled by the first save:

    do_cpu invoked from kernel context![#1]:
    CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.3.0-rc2-dirty #2
    task: 800000041f84a008 ti: 800000041f864000 task.ti: 800000041f864000
    $ 0   : 0000000000000000 0000000010008ce1 0000000000100000 ffffffffbfffffff
    $ 4   : 800000041f84a008 800000041f84ac08 800000041f84c000 0000000000000004
    $ 8   : 0000000000000001 0000000000000000 0000000000000000 0000000000000001
    $12   : 0000000010008ce3 0000000000119c60 0000000000000036 800000041f864000
    $16   : 800000041f84ac08 800000000792ce80 800000041f84a008 ffffffff81758b00
    $20   : 0000000000000000 ffffffff8175ae50 0000000000000000 ffffffff8176c740
    $24   : 0000000000000006 ffffffff81170300
    $28   : 800000041f864000 800000041f867d90 0000000000000000 ffffffff815f3fa0
    Hi    : 0000000000fa8257
    Lo    : ffffffffe15cfc00
    epc   : ffffffff8112821c resume+0x9c/0x200
    ra    : ffffffff815f3fa0 __schedule+0x3f0/0x7d8
    Status: 10008ce2        KX SX UX KERNEL EXL
    Cause : 1080002c (ExcCode 0b)
    PrId  : 000d0601 (Cavium Octeon+)
    Modules linked in:
    Process kthreadd (pid: 2, threadinfo=800000041f864000, task=800000041f84a008, tls=0000000000000000)
    Stack : ffffffff81604218 ffffffff815f7e08 800000041f84a008 ffffffff811681b0
              800000041f84a008 ffffffff817e9878 0000000000000000 ffffffff81770000
              ffffffff81768340 ffffffff81161398 0000000000000001 0000000000000000
              0000000000000000 ffffffff815f4424 0000000000000000 ffffffff81161d68
              ffffffff81161be8 0000000000000000 0000000000000000 0000000000000000
              0000000000000000 0000000000000000 0000000000000000 ffffffff8111e16c
              0000000000000000 0000000000000000 0000000000000000 0000000000000000
              0000000000000000 0000000000000000 0000000000000000 0000000000000000
              0000000000000000 0000000000000000 0000000000000000 0000000000000000
              0000000000000000 0000000000000000 0000000000000000 0000000000000000
              ...
    Call Trace:
    [<ffffffff8112821c>] resume+0x9c/0x200
    [<ffffffff815f3fa0>] __schedule+0x3f0/0x7d8
    [<ffffffff815f4424>] schedule+0x34/0x98
    [<ffffffff81161d68>] kthreadd+0x180/0x198
    [<ffffffff8111e16c>] ret_from_kernel_thread+0x14/0x1c

Tested using cavium_octeon_defconfig on an EdgeRouter Lite.

Fixes: 1a3d595 ("MIPS: Tidy up FPU context switching")
Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Aleksey Makarov <aleksey.makarov@auriga.com>
Cc: linux-kernel@vger.kernel.org
Cc: Chandrakala Chavva <cchavva@caviumnetworks.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Leonid Rosenboim <lrosenboim@caviumnetworks.com>
Patchwork: https://patchwork.linux-mips.org/patch/11166/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
jwrdegoede pushed a commit that referenced this pull request Oct 13, 2015
My colleague ran into a program stall on a x86_64 server, where
n_tty_read() was waiting for data even if there was data in the buffer
in the pty.  kernel stack for the stuck process looks like below.
 #0 [ffff88303d107b58] __schedule at ffffffff815c4b20
 #1 [ffff88303d107bd0] schedule at ffffffff815c513e
 #2 [ffff88303d107bf0] schedule_timeout at ffffffff815c7818
 #3 [ffff88303d107ca0] wait_woken at ffffffff81096bd2
 #4 [ffff88303d107ce0] n_tty_read at ffffffff8136fa23
 #5 [ffff88303d107dd0] tty_read at ffffffff81368013
 #6 [ffff88303d107e20] __vfs_read at ffffffff811a3704
 linux-sunxi#7 [ffff88303d107ec0] vfs_read at ffffffff811a3a57
 linux-sunxi#8 [ffff88303d107f00] sys_read at ffffffff811a4306
 linux-sunxi#9 [ffff88303d107f50] entry_SYSCALL_64_fastpath at ffffffff815c86d7

There seems to be two problems causing this issue.

First, in drivers/tty/n_tty.c, __receive_buf() stores the data and
updates ldata->commit_head using smp_store_release() and then checks
the wait queue using waitqueue_active().  However, since there is no
memory barrier, __receive_buf() could return without calling
wake_up_interactive_poll(), and at the same time, n_tty_read() could
start to wait in wait_woken() as in the following chart.

        __receive_buf()                         n_tty_read()
------------------------------------------------------------------------
if (waitqueue_active(&tty->read_wait))
/* Memory operations issued after the
   RELEASE may be completed before the
   RELEASE operation has completed */
                                        add_wait_queue(&tty->read_wait, &wait);
                                        ...
                                        if (!input_available_p(tty, 0)) {
smp_store_release(&ldata->commit_head,
                  ldata->read_head);
                                        ...
                                        timeout = wait_woken(&wait,
                                          TASK_INTERRUPTIBLE, timeout);
------------------------------------------------------------------------

The second problem is that n_tty_read() also lacks a memory barrier
call and could also cause __receive_buf() to return without calling
wake_up_interactive_poll(), and n_tty_read() to wait in wait_woken()
as in the chart below.

        __receive_buf()                         n_tty_read()
------------------------------------------------------------------------
                                        spin_lock_irqsave(&q->lock, flags);
                                        /* from add_wait_queue() */
                                        ...
                                        if (!input_available_p(tty, 0)) {
                                        /* Memory operations issued after the
                                           RELEASE may be completed before the
                                           RELEASE operation has completed */
smp_store_release(&ldata->commit_head,
                  ldata->read_head);
if (waitqueue_active(&tty->read_wait))
                                        __add_wait_queue(q, wait);
                                        spin_unlock_irqrestore(&q->lock,flags);
                                        /* from add_wait_queue() */
                                        ...
                                        timeout = wait_woken(&wait,
                                          TASK_INTERRUPTIBLE, timeout);
------------------------------------------------------------------------

There are also other places in drivers/tty/n_tty.c which have similar
calls to waitqueue_active(), so instead of adding many memory barrier
calls, this patch simply removes the call to waitqueue_active(),
leaving just wake_up*() behind.

This fixes both problems because, even though the memory access before
or after the spinlocks in both wake_up*() and add_wait_queue() can
sneak into the critical section, it cannot go past it and the critical
section assures that they will be serialized (please see "INTER-CPU
ACQUIRING BARRIER EFFECTS" in Documentation/memory-barriers.txt for a
better explanation).  Moreover, the resulting code is much simpler.

Latency measurement using a ping-pong test over a pty doesn't show any
visible performance drop.

Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
jwrdegoede pushed a commit that referenced this pull request Oct 13, 2015
When running kprobe test on arm64 rt kernel, it reports the below warning:

root@qemu7:~# modprobe kprobe_example
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
in_atomic(): 0, irqs_disabled(): 128, pid: 484, name: modprobe
CPU: 0 PID: 484 Comm: modprobe Not tainted 4.1.6-rt5 #2
Hardware name: linux,dummy-virt (DT)
Call trace:
[<ffffffc0000891b8>] dump_backtrace+0x0/0x128
[<ffffffc000089300>] show_stack+0x20/0x30
[<ffffffc00061dae8>] dump_stack+0x1c/0x28
[<ffffffc0000bbad0>] ___might_sleep+0x120/0x198
[<ffffffc0006223e8>] rt_spin_lock+0x28/0x40
[<ffffffc000622b30>] __aarch64_insn_write+0x28/0x78
[<ffffffc000622e48>] aarch64_insn_patch_text_nosync+0x18/0x48
[<ffffffc000622ee8>] aarch64_insn_patch_text_cb+0x70/0xa0
[<ffffffc000622f40>] aarch64_insn_patch_text_sync+0x28/0x48
[<ffffffc0006236e0>] arch_arm_kprobe+0x38/0x48
[<ffffffc00010e6f4>] arm_kprobe+0x34/0x50
[<ffffffc000110374>] register_kprobe+0x4cc/0x5b8
[<ffffffbffc002038>] kprobe_init+0x38/0x7c [kprobe_example]
[<ffffffc000084240>] do_one_initcall+0x90/0x1b0
[<ffffffc00061c498>] do_init_module+0x6c/0x1cc
[<ffffffc0000fd0c0>] load_module+0x17f8/0x1db0
[<ffffffc0000fd8cc>] SyS_finit_module+0xb4/0xc8

Convert patch_lock to raw loc kto avoid this issue.

Although the problem is found on rt kernel, the fix should be applicable to
mainline kernel too.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
jwrdegoede pushed a commit that referenced this pull request Sep 16, 2024
…git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains two fixes from Florian Westphal:

Patch #1 fixes a sk refcount leak in nft_socket on mismatch.

Patch #2 fixes cgroupsv2 matching from containers due to incorrect
	 level in subtree.

netfilter pull request 24-09-12

* tag 'nf-24-09-12' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_socket: make cgroupsv2 matching work with namespaces
  netfilter: nft_socket: fix sk refcount leaks
====================

Link: https://patch.msgid.link/20240911222520.3606-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
jwrdegoede pushed a commit that referenced this pull request Oct 15, 2024
On the node of an NFS client, some files saved in the mountpoint of the
NFS server were copied to another location of the same NFS server.
Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference
crash with the following syslog:

[232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[232066.588586] Mem abort info:
[232066.588701]   ESR = 0x0000000096000007
[232066.588862]   EC = 0x25: DABT (current EL), IL = 32 bits
[232066.589084]   SET = 0, FnV = 0
[232066.589216]   EA = 0, S1PTW = 0
[232066.589340]   FSC = 0x07: level 3 translation fault
[232066.589559] Data abort info:
[232066.589683]   ISV = 0, ISS = 0x00000007
[232066.589842]   CM = 0, WnR = 0
[232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400
[232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000
[232066.590757] Internal error: Oops: 96000007 [#1] SMP
[232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2
[232066.591052]  vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs
[232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1
[232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06
[232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4]
[232066.598595] sp : ffff8000f568fc70
[232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000
[232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001
[232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050
[232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000
[232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000
[232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6
[232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828
[232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a
[232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058
[232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000
[232066.601636] Call trace:
[232066.601749]  nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.601998]  nfs4_do_reclaim+0x1b8/0x28c [nfsv4]
[232066.602218]  nfs4_state_manager+0x928/0x10f0 [nfsv4]
[232066.602455]  nfs4_run_state_manager+0x78/0x1b0 [nfsv4]
[232066.602690]  kthread+0x110/0x114
[232066.602830]  ret_from_fork+0x10/0x20
[232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00)
[232066.603284] SMP: stopping secondary CPUs
[232066.606936] Starting crashdump kernel...
[232066.607146] Bye!

Analysing the vmcore, we know that nfs4_copy_state listed by destination
nfs_server->ss_copies was added by the field copies in handle_async_copy(),
and we found a waiting copy process with the stack as:
PID: 3511963  TASK: ffff710028b47e00  CPU: 0   COMMAND: "cp"
 #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4
 #1 [ffff8001116ef760] __schedule at ffff800008dd0650
 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00
 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0
 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c
 #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898
 #6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4]
 linux-sunxi#7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4]
 linux-sunxi#8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4]
 linux-sunxi#9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4]

The NULL-pointer dereference was due to nfs42_complete_copies() listed
the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state.
So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and
the data accessed through this pointer was also incorrect. Generally,
the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or
open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state().
When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED
and copies are not deleted in nfs_server->ss_copies, the source state
may be passed to the nfs42_complete_copies() process earlier, resulting
in this crash scene finally. To solve this issue, we add a list_head
nfs_server->ss_src_copies for a server-to-server copy specially.

Fixes: 0e65a32 ("NFS: handle source server reboot")
Signed-off-by: Yanjun Zhang <zhangyanjun@cestc.cn>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
jwrdegoede pushed a commit that referenced this pull request Oct 15, 2024
Fix a kernel panic in the br_netfilter module when sending untagged
traffic via a VxLAN device.
This happens during the check for fragmentation in br_nf_dev_queue_xmit.

It is dependent on:
1) the br_netfilter module being loaded;
2) net.bridge.bridge-nf-call-iptables set to 1;
3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port;
4) untagged frames with size higher than the VxLAN MTU forwarded/flooded

When forwarding the untagged packet to the VxLAN bridge port, before
the netfilter hooks are called, br_handle_egress_vlan_tunnel is called and
changes the skb_dst to the tunnel dst. The tunnel_dst is a metadata type
of dst, i.e., skb_valid_dst(skb) is false, and metadata->dst.dev is NULL.

Then in the br_netfilter hooks, in br_nf_dev_queue_xmit, there's a check
for frames that needs to be fragmented: frames with higher MTU than the
VxLAN device end up calling br_nf_ip_fragment, which in turns call
ip_skb_dst_mtu.

The ip_dst_mtu tries to use the skb_dst(skb) as if it was a valid dst
with valid dst->dev, thus the crash.

This case was never supported in the first place, so drop the packet
instead.

PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data.
[  176.291791] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000110
[  176.292101] Mem abort info:
[  176.292184]   ESR = 0x0000000096000004
[  176.292322]   EC = 0x25: DABT (current EL), IL = 32 bits
[  176.292530]   SET = 0, FnV = 0
[  176.292709]   EA = 0, S1PTW = 0
[  176.292862]   FSC = 0x04: level 0 translation fault
[  176.293013] Data abort info:
[  176.293104]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  176.293488]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  176.293787]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000
[  176.294166] [0000000000000110] pgd=0000000000000000,
p4d=0000000000000000
[  176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth
br_netfilter bridge stp llc ipv6 crct10dif_ce
[  176.295923] CPU: 0 PID: 188 Comm: ping Not tainted
6.8.0-rc3-g5b3fbd61b9d1 #2
[  176.296314] Hardware name: linux,dummy-virt (DT)
[  176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[  176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter]
[  176.297636] sp : ffff800080003630
[  176.297743] x29: ffff800080003630 x28: 0000000000000008 x27:
ffff6828c49ad9f8
[  176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24:
00000000000003e8
[  176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21:
ffff6828c3b16d28
[  176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18:
0000000000000014
[  176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15:
0000000095744632
[  176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12:
ffffb7e137926a70
[  176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 :
0000000000000000
[  176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 :
f20e0100bebafeca
[  176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 :
0000000000000000
[  176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 :
ffff6828c7f918f0
[  176.300889] Call trace:
[  176.301123]  br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.301411]  br_nf_post_routing+0x2a8/0x3e4 [br_netfilter]
[  176.301703]  nf_hook_slow+0x48/0x124
[  176.302060]  br_forward_finish+0xc8/0xe8 [bridge]
[  176.302371]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.302605]  br_nf_forward_finish+0x118/0x22c [br_netfilter]
[  176.302824]  br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter]
[  176.303136]  br_nf_forward+0x2b8/0x4e0 [br_netfilter]
[  176.303359]  nf_hook_slow+0x48/0x124
[  176.303803]  __br_forward+0xc4/0x194 [bridge]
[  176.304013]  br_flood+0xd4/0x168 [bridge]
[  176.304300]  br_handle_frame_finish+0x1d4/0x5c4 [bridge]
[  176.304536]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.304978]  br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter]
[  176.305188]  br_nf_pre_routing+0x250/0x524 [br_netfilter]
[  176.305428]  br_handle_frame+0x244/0x3cc [bridge]
[  176.305695]  __netif_receive_skb_core.constprop.0+0x33c/0xecc
[  176.306080]  __netif_receive_skb_one_core+0x40/0x8c
[  176.306197]  __netif_receive_skb+0x18/0x64
[  176.306369]  process_backlog+0x80/0x124
[  176.306540]  __napi_poll+0x38/0x17c
[  176.306636]  net_rx_action+0x124/0x26c
[  176.306758]  __do_softirq+0x100/0x26c
[  176.307051]  ____do_softirq+0x10/0x1c
[  176.307162]  call_on_irq_stack+0x24/0x4c
[  176.307289]  do_softirq_own_stack+0x1c/0x2c
[  176.307396]  do_softirq+0x54/0x6c
[  176.307485]  __local_bh_enable_ip+0x8c/0x98
[  176.307637]  __dev_queue_xmit+0x22c/0xd28
[  176.307775]  neigh_resolve_output+0xf4/0x1a0
[  176.308018]  ip_finish_output2+0x1c8/0x628
[  176.308137]  ip_do_fragment+0x5b4/0x658
[  176.308279]  ip_fragment.constprop.0+0x48/0xec
[  176.308420]  __ip_finish_output+0xa4/0x254
[  176.308593]  ip_finish_output+0x34/0x130
[  176.308814]  ip_output+0x6c/0x108
[  176.308929]  ip_send_skb+0x50/0xf0
[  176.309095]  ip_push_pending_frames+0x30/0x54
[  176.309254]  raw_sendmsg+0x758/0xaec
[  176.309568]  inet_sendmsg+0x44/0x70
[  176.309667]  __sys_sendto+0x110/0x178
[  176.309758]  __arm64_sys_sendto+0x28/0x38
[  176.309918]  invoke_syscall+0x48/0x110
[  176.310211]  el0_svc_common.constprop.0+0x40/0xe0
[  176.310353]  do_el0_svc+0x1c/0x28
[  176.310434]  el0_svc+0x34/0xb4
[  176.310551]  el0t_64_sync_handler+0x120/0x12c
[  176.310690]  el0t_64_sync+0x190/0x194
[  176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860)
[  176.315743] ---[ end trace 0000000000000000 ]---
[  176.316060] Kernel panic - not syncing: Oops: Fatal exception in
interrupt
[  176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000
[  176.316564] PHYS_OFFSET: 0xffff97d780000000
[  176.316782] CPU features: 0x0,88000203,3c020000,0100421b
[  176.317210] Memory Limit: none
[  176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal
Exception in interrupt ]---\

Fixes: 11538d0 ("bridge: vlan dst_metadata hooks in ingress and egress paths")
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Andy Roulin <aroulin@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241001154400.22787-2-aroulin@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Oct 15, 2024
Andy Roulin says:

====================
netfilter: br_netfilter: fix panic with metadata_dst skb

There's a kernel panic possible in the br_netfilter module when sending
untagged traffic via a VxLAN device. Traceback is included below.
This happens during the check for fragmentation in br_nf_dev_queue_xmit
if the MTU on the VxLAN device is not big enough.

It is dependent on:
1) the br_netfilter module being loaded;
2) net.bridge.bridge-nf-call-iptables set to 1;
3) a bridge with a VxLAN (single-vxlan-device) netdevice as a bridge port;
4) untagged frames with size higher than the VxLAN MTU forwarded/flooded

This case was never supported in the first place, so the first patch drops
such packets.

A regression selftest is added as part of the second patch.

PING 10.0.0.2 (10.0.0.2) from 0.0.0.0 h1-eth0: 2000(2028) bytes of data.
[  176.291791] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000110
[  176.292101] Mem abort info:
[  176.292184]   ESR = 0x0000000096000004
[  176.292322]   EC = 0x25: DABT (current EL), IL = 32 bits
[  176.292530]   SET = 0, FnV = 0
[  176.292709]   EA = 0, S1PTW = 0
[  176.292862]   FSC = 0x04: level 0 translation fault
[  176.293013] Data abort info:
[  176.293104]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  176.293488]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  176.293787]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  176.293995] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000043ef5000
[  176.294166] [0000000000000110] pgd=0000000000000000,
p4d=0000000000000000
[  176.294827] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  176.295252] Modules linked in: vxlan ip6_udp_tunnel udp_tunnel veth
br_netfilter bridge stp llc ipv6 crct10dif_ce
[  176.295923] CPU: 0 PID: 188 Comm: ping Not tainted
6.8.0-rc3-g5b3fbd61b9d1 #2
[  176.296314] Hardware name: linux,dummy-virt (DT)
[  176.296535] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[  176.296808] pc : br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.297382] lr : br_nf_dev_queue_xmit+0x2ac/0x4ec [br_netfilter]
[  176.297636] sp : ffff800080003630
[  176.297743] x29: ffff800080003630 x28: 0000000000000008 x27:
ffff6828c49ad9f8
[  176.298093] x26: ffff6828c49ad000 x25: 0000000000000000 x24:
00000000000003e8
[  176.298430] x23: 0000000000000000 x22: ffff6828c4960b40 x21:
ffff6828c3b16d28
[  176.298652] x20: ffff6828c3167048 x19: ffff6828c3b16d00 x18:
0000000000000014
[  176.298926] x17: ffffb0476322f000 x16: ffffb7e164023730 x15:
0000000095744632
[  176.299296] x14: ffff6828c3f1c880 x13: 0000000000000002 x12:
ffffb7e137926a70
[  176.299574] x11: 0000000000000001 x10: ffff6828c3f1c898 x9 :
0000000000000000
[  176.300049] x8 : ffff6828c49bf070 x7 : 0008460f18d5f20e x6 :
f20e0100bebafeca
[  176.300302] x5 : ffff6828c7f918fe x4 : ffff6828c49bf070 x3 :
0000000000000000
[  176.300586] x2 : 0000000000000000 x1 : ffff6828c3c7ad00 x0 :
ffff6828c7f918f0
[  176.300889] Call trace:
[  176.301123]  br_nf_dev_queue_xmit+0x390/0x4ec [br_netfilter]
[  176.301411]  br_nf_post_routing+0x2a8/0x3e4 [br_netfilter]
[  176.301703]  nf_hook_slow+0x48/0x124
[  176.302060]  br_forward_finish+0xc8/0xe8 [bridge]
[  176.302371]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.302605]  br_nf_forward_finish+0x118/0x22c [br_netfilter]
[  176.302824]  br_nf_forward_ip.part.0+0x264/0x290 [br_netfilter]
[  176.303136]  br_nf_forward+0x2b8/0x4e0 [br_netfilter]
[  176.303359]  nf_hook_slow+0x48/0x124
[  176.303803]  __br_forward+0xc4/0x194 [bridge]
[  176.304013]  br_flood+0xd4/0x168 [bridge]
[  176.304300]  br_handle_frame_finish+0x1d4/0x5c4 [bridge]
[  176.304536]  br_nf_hook_thresh+0x124/0x134 [br_netfilter]
[  176.304978]  br_nf_pre_routing_finish+0x29c/0x494 [br_netfilter]
[  176.305188]  br_nf_pre_routing+0x250/0x524 [br_netfilter]
[  176.305428]  br_handle_frame+0x244/0x3cc [bridge]
[  176.305695]  __netif_receive_skb_core.constprop.0+0x33c/0xecc
[  176.306080]  __netif_receive_skb_one_core+0x40/0x8c
[  176.306197]  __netif_receive_skb+0x18/0x64
[  176.306369]  process_backlog+0x80/0x124
[  176.306540]  __napi_poll+0x38/0x17c
[  176.306636]  net_rx_action+0x124/0x26c
[  176.306758]  __do_softirq+0x100/0x26c
[  176.307051]  ____do_softirq+0x10/0x1c
[  176.307162]  call_on_irq_stack+0x24/0x4c
[  176.307289]  do_softirq_own_stack+0x1c/0x2c
[  176.307396]  do_softirq+0x54/0x6c
[  176.307485]  __local_bh_enable_ip+0x8c/0x98
[  176.307637]  __dev_queue_xmit+0x22c/0xd28
[  176.307775]  neigh_resolve_output+0xf4/0x1a0
[  176.308018]  ip_finish_output2+0x1c8/0x628
[  176.308137]  ip_do_fragment+0x5b4/0x658
[  176.308279]  ip_fragment.constprop.0+0x48/0xec
[  176.308420]  __ip_finish_output+0xa4/0x254
[  176.308593]  ip_finish_output+0x34/0x130
[  176.308814]  ip_output+0x6c/0x108
[  176.308929]  ip_send_skb+0x50/0xf0
[  176.309095]  ip_push_pending_frames+0x30/0x54
[  176.309254]  raw_sendmsg+0x758/0xaec
[  176.309568]  inet_sendmsg+0x44/0x70
[  176.309667]  __sys_sendto+0x110/0x178
[  176.309758]  __arm64_sys_sendto+0x28/0x38
[  176.309918]  invoke_syscall+0x48/0x110
[  176.310211]  el0_svc_common.constprop.0+0x40/0xe0
[  176.310353]  do_el0_svc+0x1c/0x28
[  176.310434]  el0_svc+0x34/0xb4
[  176.310551]  el0t_64_sync_handler+0x120/0x12c
[  176.310690]  el0t_64_sync+0x190/0x194
[  176.311066] Code: f9402e61 79402aa2 927ff821 f9400023 (f9408860)
[  176.315743] ---[ end trace 0000000000000000 ]---
[  176.316060] Kernel panic - not syncing: Oops: Fatal exception in
interrupt
[  176.316371] Kernel Offset: 0x37e0e3000000 from 0xffff800080000000
[  176.316564] PHYS_OFFSET: 0xffff97d780000000
[  176.316782] CPU features: 0x0,88000203,3c020000,0100421b
[  176.317210] Memory Limit: none
[  176.317527] ---[ end Kernel panic - not syncing: Oops: Fatal
Exception in interrupt ]---\
====================

Link: https://patch.msgid.link/20241001154400.22787-1-aroulin@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Hou Tao says:

====================
Check the remaining info_cnt before repeating btf fields

From: Hou Tao <houtao1@huawei.com>

Hi,

The patch set adds the missed check again info_cnt when flattening the
array of nested struct. The problem was spotted when developing dynptr
key support for hash map. Patch #1 adds the missed check and patch #2
adds three success test cases and one failure test case for the problem.

Comments are always welcome.

Change Log:
v2:
 * patch #1: check info_cnt in btf_repeat_fields()
 * patch #2: use a hard-coded number instead of BTF_FIELDS_MAX, because
             BTF_FIELDS_MAX is not always available in vmlinux.h (e.g.,
	     for llvm 17/18)

v1: https://lore.kernel.org/bpf/20240911110557.2759801-1-houtao@huaweicloud.com/T/#t
====================

Link: https://lore.kernel.org/r/20241008071114.3718177-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Syzkaller reported a lockdep splat:

  ============================================
  WARNING: possible recursive locking detected
  6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted
  --------------------------------------------
  syz-executor364/5113 is trying to acquire lock:
  ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  but task is already holding lock:
  ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
  ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(k-slock-AF_INET);
    lock(k-slock-AF_INET);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  7 locks held by syz-executor364/5113:
   #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
   #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806
   #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline]
   #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228
   #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
   #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
   #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232
   #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline]
   #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328

  stack backtrace:
  CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  Call Trace:
   <IRQ>
   __dump_stack lib/dump_stack.c:93 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
   check_deadlock kernel/locking/lockdep.c:3061 [inline]
   validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855
   __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759
   __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
   _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154
   spin_lock include/linux/spinlock.h:351 [inline]
   sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328
   mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279
   subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874
   tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853
   tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267
   ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205
   ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233
   NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
   NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314
   __netif_receive_skb_one_core net/core/dev.c:5661 [inline]
   __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775
   process_backlog+0x662/0x15b0 net/core/dev.c:6108
   __napi_poll+0xcb/0x490 net/core/dev.c:6772
   napi_poll net/core/dev.c:6841 [inline]
   net_rx_action+0x89b/0x1240 net/core/dev.c:6963
   handle_softirqs+0x2c4/0x970 kernel/softirq.c:554
   do_softirq+0x11b/0x1e0 kernel/softirq.c:455
   </IRQ>
   <TASK>
   __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382
   local_bh_enable include/linux/bottom_half.h:33 [inline]
   rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline]
   __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450
   dev_queue_xmit include/linux/netdevice.h:3105 [inline]
   neigh_hh_output include/net/neighbour.h:526 [inline]
   neigh_output include/net/neighbour.h:540 [inline]
   ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235
   ip_local_out net/ipv4/ip_output.c:129 [inline]
   __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535
   __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466
   tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline]
   tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729
   tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934
   sk_backlog_rcv include/net/sock.h:1111 [inline]
   __release_sock+0x214/0x350 net/core/sock.c:3004
   release_sock+0x61/0x1f0 net/core/sock.c:3558
   mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733
   mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812
   sock_sendmsg_nosec net/socket.c:730 [inline]
   __sock_sendmsg+0x1a6/0x270 net/socket.c:745
   ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597
   ___sys_sendmsg net/socket.c:2651 [inline]
   __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737
   __do_sys_sendmmsg net/socket.c:2766 [inline]
   __se_sys_sendmmsg net/socket.c:2763 [inline]
   __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f04fb13a6b9
  Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
  RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9
  RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004
  RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240
  R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300
   </TASK>

As noted by Cong Wang, the splat is false positive, but the code
path leading to the report is an unexpected one: a client is
attempting an MPC handshake towards the in-kernel listener created
by the in-kernel PM for a port based signal endpoint.

Such connection will be never accepted; many of them can make the
listener queue full and preventing the creation of MPJ subflow via
such listener - its intended role.

Explicitly detect this scenario at initial-syn time and drop the
incoming MPC request.

Fixes: 1729cf1 ("mptcp: create the listening socket for new port")
Cc: stable@vger.kernel.org
Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e
Cc: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241014-net-mptcp-mpc-port-endp-v2-1-7faea8e6b6ae@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.12, take #2

- Fix the guest view of the ID registers, making the relevant fields
  writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1)

- Correcly expose S1PIE to guests, fixing a regression introduced
  in 6.12-rc1 with the S1POE support

- Fix the recycling of stage-2 shadow MMUs by tracking the context
  (are we allowed to block or not) as well as the recycling state

- Address a couple of issues with the vgic when userspace misconfigures
  the emulation, resulting in various splats. Headaches courtesy
  of our Syzkaller friends
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Currently, when configuring TMU (Time Management Unit) mode of a given
router, we take into account only its own TMU requirements ignoring
other routers in the domain. This is problematic if the router we are
configuring has lower TMU requirements than what is already configured
in the domain.

In the scenario below, we have a host router with two USB4 ports: A and
B. Port A connected to device router #1 (which supports CL states) and
existing DisplayPort tunnel, thus, the TMU mode is HiFi uni-directional.

1. Initial topology

          [Host]
         A/
         /
 [Device #1]
   /
Monitor

2. Plug in device #2 (that supports CL states) to downstream port B of
   the host router

         [Host]
        A/    B\
        /       \
 [Device #1]    [Device #2]
   /
Monitor

The TMU mode on port B and port A will be configured to LowRes which is
not what we want and will cause monitor to start flickering.

To address this we first scan the domain and search for any router
configured to HiFi uni-directional mode, and if found, configure TMU
mode of the given router to HiFi uni-directional as well.

Cc: stable@vger.kernel.org
Signed-off-by: Gil Fine <gil.fine@linux.intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Fix possible use-after-free in 'taprio_dump()' by adding RCU
read-side critical section there. Never seen on x86 but
found on a KASAN-enabled arm64 system when investigating
https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa:

[T15862] BUG: KASAN: slab-use-after-free in taprio_dump+0xa0c/0xbb0
[T15862] Read of size 4 at addr ffff0000d4bb88f8 by task repro/15862
[T15862]
[T15862] CPU: 0 UID: 0 PID: 15862 Comm: repro Not tainted 6.11.0-rc1-00293-gdefaf1a2113a-dirty #2
[T15862] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-20240524-5.fc40 05/24/2024
[T15862] Call trace:
[T15862]  dump_backtrace+0x20c/0x220
[T15862]  show_stack+0x2c/0x40
[T15862]  dump_stack_lvl+0xf8/0x174
[T15862]  print_report+0x170/0x4d8
[T15862]  kasan_report+0xb8/0x1d4
[T15862]  __asan_report_load4_noabort+0x20/0x2c
[T15862]  taprio_dump+0xa0c/0xbb0
[T15862]  tc_fill_qdisc+0x540/0x1020
[T15862]  qdisc_notify.isra.0+0x330/0x3a0
[T15862]  tc_modify_qdisc+0x7b8/0x1838
[T15862]  rtnetlink_rcv_msg+0x3c8/0xc20
[T15862]  netlink_rcv_skb+0x1f8/0x3d4
[T15862]  rtnetlink_rcv+0x28/0x40
[T15862]  netlink_unicast+0x51c/0x790
[T15862]  netlink_sendmsg+0x79c/0xc20
[T15862]  __sock_sendmsg+0xe0/0x1a0
[T15862]  ____sys_sendmsg+0x6c0/0x840
[T15862]  ___sys_sendmsg+0x1ac/0x1f0
[T15862]  __sys_sendmsg+0x110/0x1d0
[T15862]  __arm64_sys_sendmsg+0x74/0xb0
[T15862]  invoke_syscall+0x88/0x2e0
[T15862]  el0_svc_common.constprop.0+0xe4/0x2a0
[T15862]  do_el0_svc+0x44/0x60
[T15862]  el0_svc+0x50/0x184
[T15862]  el0t_64_sync_handler+0x120/0x12c
[T15862]  el0t_64_sync+0x190/0x194
[T15862]
[T15862] Allocated by task 15857:
[T15862]  kasan_save_stack+0x3c/0x70
[T15862]  kasan_save_track+0x20/0x3c
[T15862]  kasan_save_alloc_info+0x40/0x60
[T15862]  __kasan_kmalloc+0xd4/0xe0
[T15862]  __kmalloc_cache_noprof+0x194/0x334
[T15862]  taprio_change+0x45c/0x2fe0
[T15862]  tc_modify_qdisc+0x6a8/0x1838
[T15862]  rtnetlink_rcv_msg+0x3c8/0xc20
[T15862]  netlink_rcv_skb+0x1f8/0x3d4
[T15862]  rtnetlink_rcv+0x28/0x40
[T15862]  netlink_unicast+0x51c/0x790
[T15862]  netlink_sendmsg+0x79c/0xc20
[T15862]  __sock_sendmsg+0xe0/0x1a0
[T15862]  ____sys_sendmsg+0x6c0/0x840
[T15862]  ___sys_sendmsg+0x1ac/0x1f0
[T15862]  __sys_sendmsg+0x110/0x1d0
[T15862]  __arm64_sys_sendmsg+0x74/0xb0
[T15862]  invoke_syscall+0x88/0x2e0
[T15862]  el0_svc_common.constprop.0+0xe4/0x2a0
[T15862]  do_el0_svc+0x44/0x60
[T15862]  el0_svc+0x50/0x184
[T15862]  el0t_64_sync_handler+0x120/0x12c
[T15862]  el0t_64_sync+0x190/0x194
[T15862]
[T15862] Freed by task 6192:
[T15862]  kasan_save_stack+0x3c/0x70
[T15862]  kasan_save_track+0x20/0x3c
[T15862]  kasan_save_free_info+0x4c/0x80
[T15862]  poison_slab_object+0x110/0x160
[T15862]  __kasan_slab_free+0x3c/0x74
[T15862]  kfree+0x134/0x3c0
[T15862]  taprio_free_sched_cb+0x18c/0x220
[T15862]  rcu_core+0x920/0x1b7c
[T15862]  rcu_core_si+0x10/0x1c
[T15862]  handle_softirqs+0x2e8/0xd64
[T15862]  __do_softirq+0x14/0x20

Fixes: 18cdd2f ("net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex")
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://patch.msgid.link/20241018051339.418890-2-dmantipov@yandex.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Running rcutorture scenario TREE05, the below warning is triggered.

[   32.604594] WARNING: suspicious RCU usage
[   32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted
[   32.607812] -----------------------------
[   32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!!
[   32.611595] other info that might help us debug this:
[   32.614247] rcu_scheduler_active = 2, debug_locks = 1
[   32.616392] 3 locks held by cpuhp/4/35:
[   32.617687]  #0: ffffffffb666a650 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.620563]  #1: ffffffffb666cd20 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.623412]  #2: ffffffffb677c288 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context+0x32/0x2f0

In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an
obvious RCU read-side critical section.

Either pmus_srcu or pmus_lock is good enough to protect the pmus list.
In the current context, pmus_lock is already held. The
list_for_each_entry_rcu() is not required.

Fixes: 4ba4f1a ("perf: Generic hotplug support for a PMU with a scope")
Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/
Closes: https://lore.kernel.org/oe-lkp/202409131559.545634cc-oliver.sang@intel.com
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240913162340.2142976-1-kan.liang@linux.intel.com
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Hou Tao says:

====================
Add the missing BPF_LINK_TYPE invocation for sockmap

From: Hou Tao <houtao1@huawei.com>

Hi,

The tiny patch set fixes the out-of-bound read problem when reading the
fdinfo of sock map link fd. And in order to spot such omission early for
the newly-added link type in the future, it also checks the validity of
the link->type and adds a WARN_ONCE() for missed invocation.

Please see individual patches for more details. And comments are always
welcome.

v3:
  * patch #2: check and warn the validity of link->type instead of
    adding a static assertion for bpf_link_type_strs array.

v2: http://lore.kernel.org/bpf/d49fa2f4-f743-c763-7579-c3cab4dd88cb@huaweicloud.com
====================

Link: https://lore.kernel.org/r/20241024013558.1135167-1-houtao@huaweicloud.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
generic/077 on x86_32 CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y with highmem,
on huge=always tmpfs, issues a warning and then hangs (interruptibly):

WARNING: CPU: 5 PID: 3517 at mm/highmem.c:622 kunmap_local_indexed+0x62/0xc9
CPU: 5 UID: 0 PID: 3517 Comm: cp Not tainted 6.12.0-rc4 #2
...
copy_page_from_iter_atomic+0xa6/0x5ec
generic_perform_write+0xf6/0x1b4
shmem_file_write_iter+0x54/0x67

Fix copy_page_from_iter_atomic() by limiting it in that case
(include/linux/skbuff.h skb_frag_must_loop() does similar).

But going forward, perhaps CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is too
surprising, has outlived its usefulness, and should just be removed?

Fixes: 908a1ad ("iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/r/dd5f0c89-186e-18e1-4f43-19a60f5a9774@google.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Hou Tao says:

====================
The patch set fixes several issues in bits iterator. Patch #1 fixes the
kmemleak problem of bits iterator. Patch #2~#3 fix the overflow problem
of nr_bits. Patch #4 fixes the potential stack corruption when bits
iterator is used on 32-bit host. Patch #5 adds more test cases for bits
iterator.

Please see the individual patches for more details. And comments are
always welcome.
---
v4:
 * patch #1: add ack from Yafang
 * patch #3: revert code-churn like changes:
   (1) compute nr_bytes and nr_bits before the check of nr_words.
   (2) use nr_bits == 64 to check for single u64, preventing build
       warning on 32-bit hosts.
 * patch #4: use "BITS_PER_LONG == 32" instead of "!defined(CONFIG_64BIT)"

v3: https://lore.kernel.org/bpf/20241025013233.804027-1-houtao@huaweicloud.com/T/#t
  * split the bits-iterator related patches from "Misc fixes for bpf"
    patch set
  * patch #1: use "!nr_bits || bits >= nr_bits" to stop the iteration
  * patch #2: add a new helper for the overflow problem
  * patch #3: decrease the limitation from 512 to 511 and check whether
    nr_bytes is too large for bpf memory allocator explicitly
  * patch #5: add two more test cases for bit iterator

v2: http://lore.kernel.org/bpf/d49fa2f4-f743-c763-7579-c3cab4dd88cb@huaweicloud.com
====================

Link: https://lore.kernel.org/r/20241030100516.3633640-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
Petr Machata says:

====================
mlxsw: Fixes

In this patchset:

- Tx header should be pushed for each packet which is transmitted via
  Spectrum ASICs. Patch #1 adds a missing call to skb_cow_head() to make
  sure that there is both enough room to push the Tx header and that the
  SKB header is not cloned and can be modified.

- Commit b5b60bb ("mlxsw: pci: Use page pool for Rx buffers
  allocation") converted mlxsw to use page pool for Rx buffers allocation.
  Sync for CPU and for device should be done for Rx pages. In patches #2
  and #3, add the missing calls to sync pages for, respectively, CPU and
  the device.

- Patch #4 then fixes a bug to IPv6 GRE forwarding offload. Patch #5 adds
  a generic forwarding test that fails with mlxsw ports prior to the fix.
====================

Link: https://patch.msgid.link/cover.1729866134.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Nov 6, 2024
When we compile and load lib/slub_kunit.c,it will cause a panic.

The root cause is that __kmalloc_cache_noprof was directly called instead
of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This
caused current->alloc_tag to be null,leading to a null pointer dereference
in alloc_tag_ref_set.

Despite the fact that my colleague Pei Xiao will later fix the code in
slub_kunit.c,we still need fix null pointer check logic for ref and tag to
avoid panic caused by a null pointer dereference.

Here is the log for the panic:

[   74.779373][ T2158] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
[   74.780130][ T2158] Mem abort info:
[   74.780406][ T2158]   ESR = 0x0000000096000004
[   74.780756][ T2158]   EC = 0x25: DABT (current EL), IL = 32 bits
[   74.781225][ T2158]   SET = 0, FnV = 0
[   74.781529][ T2158]   EA = 0, S1PTW = 0
[   74.781836][ T2158]   FSC = 0x04: level 0 translation fault
[   74.782288][ T2158] Data abort info:
[   74.782577][ T2158]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[   74.783068][ T2158]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   74.783533][ T2158]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   74.784010][ T2158] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000105f34000
[   74.784586][ T2158] [0000000000000020] pgd=0000000000000000, p4d=0000000000000000
[   74.785293][ T2158] Internal error: Oops: 0000000096000004 [#1] SMP
[   74.785805][ T2158] Modules linked in: slub_kunit kunit ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle 4
[   74.790661][ T2158] CPU: 0 UID: 0 PID: 2158 Comm: kunit_try_catch Kdump: loaded Tainted: G        W        N 6.12.0-rc3+ #2
[   74.791535][ T2158] Tainted: [W]=WARN, [N]=TEST
[   74.791889][ T2158] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[   74.792479][ T2158] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   74.793101][ T2158] pc : alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.793607][ T2158] lr : alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.794095][ T2158] sp : ffff800084d33cd0
[   74.794418][ T2158] x29: ffff800084d33cd0 x28: 0000000000000000 x27: 0000000000000000
[   74.795095][ T2158] x26: 0000000000000000 x25: 0000000000000012 x24: ffff80007b30e314
[   74.795822][ T2158] x23: ffff000390ff6f10 x22: 0000000000000000 x21: 0000000000000088
[   74.796555][ T2158] x20: ffff000390285840 x19: fffffd7fc3ef7830 x18: ffffffffffffffff
[   74.797283][ T2158] x17: ffff8000800e63b4 x16: ffff80007b33afc4 x15: ffff800081654c00
[   74.798011][ T2158] x14: 0000000000000000 x13: 205d383531325420 x12: 5b5d383734363537
[   74.798744][ T2158] x11: ffff800084d337e0 x10: 000000000000005d x9 : 00000000ffffffd0
[   74.799476][ T2158] x8 : 7f7f7f7f7f7f7f7f x7 : ffff80008219d188 x6 : c0000000ffff7fff
[   74.800206][ T2158] x5 : ffff0003fdbc9208 x4 : ffff800081edd188 x3 : 0000000000000001
[   74.800932][ T2158] x2 : 0beaa6dee1ac5a00 x1 : 0beaa6dee1ac5a00 x0 : ffff80037c2cb000
[   74.801656][ T2158] Call trace:
[   74.801954][ T2158]  alloc_tagging_slab_alloc_hook+0x120/0x270
[   74.802494][ T2158]  __kmalloc_cache_noprof+0x148/0x33c
[   74.802976][ T2158]  test_kmalloc_redzone_access+0x4c/0x104 [slub_kunit]
[   74.803607][ T2158]  kunit_try_run_case+0x70/0x17c [kunit]
[   74.804124][ T2158]  kunit_generic_run_threadfn_adapter+0x2c/0x4c [kunit]
[   74.804768][ T2158]  kthread+0x10c/0x118
[   74.805141][ T2158]  ret_from_fork+0x10/0x20
[   74.805540][ T2158] Code: b9400a80 11000400 b9000a80 97ffd858 (f94012d3)
[   74.806176][ T2158] SMP: stopping secondary CPUs
[   74.808130][ T2158] Starting crashdump kernel...

Link: https://lkml.kernel.org/r/20241020070819.307944-1-hao.ge@linux.dev
Fixes: e0a955b ("mm/codetag: add pgalloc_tag_copy()")
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jwrdegoede pushed a commit that referenced this pull request Nov 18, 2024
The scope of the TX skb is wider than just mse102x_tx_frame_spi(),
so in case the TX skb room needs to be expanded, we should free the
the temporary skb instead of the original skb. Otherwise the original
TX skb pointer would be freed again in mse102x_tx_work(), which leads
to crashes:

  Internal error: Oops: 0000000096000004 [#2] PREEMPT SMP
  CPU: 0 PID: 712 Comm: kworker/0:1 Tainted: G      D            6.6.23
  Hardware name: chargebyte Charge SOM DC-ONE (DT)
  Workqueue: events mse102x_tx_work [mse102x]
  pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : skb_release_data+0xb8/0x1d8
  lr : skb_release_data+0x1ac/0x1d8
  sp : ffff8000819a3cc0
  x29: ffff8000819a3cc0 x28: ffff0000046daa60 x27: ffff0000057f2dc0
  x26: ffff000005386c00 x25: 0000000000000002 x24: 00000000ffffffff
  x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000057f2e50
  x20: 0000000000000006 x19: 0000000000000000 x18: ffff00003fdacfcc
  x17: e69ad452d0c49def x16: 84a005feff870102 x15: 0000000000000000
  x14: 000000000000024a x13: 0000000000000002 x12: 0000000000000000
  x11: 0000000000000400 x10: 0000000000000930 x9 : ffff00003fd913e8
  x8 : fffffc00001bc008
  x7 : 0000000000000000 x6 : 0000000000000008
  x5 : ffff00003fd91340 x4 : 0000000000000000 x3 : 0000000000000009
  x2 : 00000000fffffffe x1 : 0000000000000000 x0 : 0000000000000000
  Call trace:
   skb_release_data+0xb8/0x1d8
   kfree_skb_reason+0x48/0xb0
   mse102x_tx_work+0x164/0x35c [mse102x]
   process_one_work+0x138/0x260
   worker_thread+0x32c/0x438
   kthread+0x118/0x11c
   ret_from_fork+0x10/0x20
  Code: aa1303e0 97fffab6 72001c1f 54000141 (f9400660)

Cc: stable@vger.kernel.org
Fixes: 2f207cb ("net: vertexcom: Add MSE102x SPI support")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://patch.msgid.link/20241105163101.33216-1-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
Since the netlink attribute range validation provides inclusive
checking, the *max* of attribute NL80211_ATTR_MLO_LINK_ID should be
IEEE80211_MLD_MAX_NUM_LINKS - 1 otherwise causing an off-by-one.

One crash stack for demonstration:
==================================================================
BUG: KASAN: wild-memory-access in ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939
Read of size 6 at addr 001102080000000c by task fuzzer.386/9508

CPU: 1 PID: 9508 Comm: syz.1.386 Not tainted 6.1.70 #2
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106
 print_report+0xe0/0x750 mm/kasan/report.c:398
 kasan_report+0x139/0x170 mm/kasan/report.c:495
 kasan_check_range+0x287/0x290 mm/kasan/generic.c:189
 memcpy+0x25/0x60 mm/kasan/shadow.c:65
 ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939
 rdev_tx_control_port net/wireless/rdev-ops.h:761 [inline]
 nl80211_tx_control_port+0x7b3/0xc40 net/wireless/nl80211.c:15453
 genl_family_rcv_msg_doit+0x22e/0x320 net/netlink/genetlink.c:756
 genl_family_rcv_msg net/netlink/genetlink.c:833 [inline]
 genl_rcv_msg+0x539/0x740 net/netlink/genetlink.c:850
 netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:861
 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline]
 netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352
 netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874
 sock_sendmsg_nosec net/socket.c:716 [inline]
 __sock_sendmsg net/socket.c:728 [inline]
 ____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499
 ___sys_sendmsg+0x21c/0x290 net/socket.c:2553
 __sys_sendmsg net/socket.c:2582 [inline]
 __do_sys_sendmsg net/socket.c:2591 [inline]
 __se_sys_sendmsg+0x19e/0x270 net/socket.c:2589
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Update the policy to ensure correct validation.

Fixes: 7b0a0e3 ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Suggested-by: Cengiz Can <cengiz.can@canonical.com>
Link: https://patch.msgid.link/20241130170526.96698-1-linma@zju.edu.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
Its used from trace__run(), for the 'perf trace' live mode, i.e. its
strace-like, non-perf.data file processing mode, the most common one.

The trace__run() function will set trace->host using machine__new_host()
that is supposed to give a machine instance representing the running
machine, and since we'll use perf_env__arch_strerrno() to get the right
errno -> string table, we need to use machine->env, so initialize it in
machine__new_host().

Before the patch:

  (gdb) run trace --errno-summary -a sleep 1
  <SNIP>
   Summary of events:

   gvfs-afc-volume (3187), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     pselect6               1      0     0.000     0.000     0.000     0.000      0.00%

   GUsbEventThread (3519), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                   1      0     0.000     0.000     0.000     0.000      0.00%
  <SNIP>
  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  478		if (env->arch_strerrno == NULL)
  (gdb) bt
  #0  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  #1  0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673
  #2  0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708
  #3  0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747
  #4  0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456
  #5  0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487
  #6  0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351
  linux-sunxi#7  0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404
  linux-sunxi#8  0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448
  linux-sunxi#9  0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560
  (gdb)

After:

  root@number:~# perf trace -a --errno-summary sleep 1
  <SNIP>
     pw-data-loop (2685), 1410 events, 16.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait           188      0   983.428     0.000     5.231    15.595      8.68%
     ioctl                 94      0     0.811     0.004     0.009     0.016      2.82%
     read                 188      0     0.322     0.001     0.002     0.006      5.15%
     write                141      0     0.280     0.001     0.002     0.018      8.39%
     timerfd_settime       94      0     0.138     0.001     0.001     0.007      6.47%

   gnome-control-c (179406), 1848 events, 20.9%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                 222      0   959.577     0.000     4.322    21.414     11.40%
     recvmsg              150      0     0.539     0.001     0.004     0.013      5.12%
     write                300      0     0.442     0.001     0.001     0.007      3.29%
     read                 150      0     0.183     0.001     0.001     0.009      5.53%
     getpid               102      0     0.101     0.000     0.001     0.008      7.82%

  root@number:~#

Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()")
Reported-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Veronika Molnarova <vmolnaro@redhat.com>
Acked-by: Michael Petlan <mpetlan@redhat.com>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
cases for virtio-net.

This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.

netdev_tx_reset_queue() can actually be dropped from virtnet_open path;
the device is not stopped in any case. For BQL core part, it's just like
traffic nearly ceases to exist for some period. For stall detector added
to BQL, even if virtnet_close could somehow lead to some TX completions
delayed for long, followed by virtnet_open, we can just take it as stall
as mentioned in commit 6025b91 ("net: dqs: add NIC stall detector
based on BQL"). Note also that users can still reset stall_max via sysfs.

So, drop netdev_tx_reset_queue() from virtnet_enable_queue_pair(). This
eliminates the BQL crashes. As a result, netdev_tx_reset_queue() is now
explicitly required in freeze/restore path. This patch adds it to
immediately after free_unused_bufs(), following the rule of thumb:
netdev_tx_reset_queue() should follow any SKB freeing not followed by
netdev_tx_completed_queue(). This seems the most consistent and
streamlined approach, and now netdev_tx_reset_queue() runs whenever
free_unused_bufs() is done.

[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? die+0x32/0x80
 ? do_trap+0xd9/0x100
 ? dql_completed+0x26b/0x290
 ? dql_completed+0x26b/0x290
 ? do_error_trap+0x6d/0xb0
 ? dql_completed+0x26b/0x290
 ? exc_invalid_op+0x4c/0x60
 ? dql_completed+0x26b/0x290
 ? asm_exc_invalid_op+0x16/0x20
 ? dql_completed+0x26b/0x290
 __free_old_xmit+0xff/0x170 [virtio_net]
 free_old_xmit+0x54/0xc0 [virtio_net]
 virtnet_poll+0xf4/0xe30 [virtio_net]
 ? __update_load_avg_cfs_rq+0x264/0x2d0
 ? update_curr+0x35/0x260
 ? reweight_entity+0x1be/0x260
 __napi_poll.constprop.0+0x28/0x1c0
 net_rx_action+0x329/0x420
 ? enqueue_hrtimer+0x35/0x90
 ? trace_hardirqs_on+0x1d/0x80
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? sched_clock_cpu+0xd/0x1a0
 handle_softirqs+0x138/0x3e0
 do_softirq.part.0+0x89/0xc0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xa7/0xb0
 virtnet_open+0xc8/0x310 [virtio_net]
 __dev_open+0xfa/0x1b0
 __dev_change_flags+0x1de/0x250
 dev_change_flags+0x22/0x60
 do_setlink.isra.0+0x2df/0x10b0
 ? rtnetlink_rcv_msg+0x34f/0x3f0
 ? netlink_rcv_skb+0x54/0x100
 ? netlink_unicast+0x23e/0x390
 ? netlink_sendmsg+0x21e/0x490
 ? ____sys_sendmsg+0x31b/0x350
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? __nla_validate_parse+0x5f/0xee0
 ? __pfx___probestub_irq_enable+0x3/0x10
 ? __create_object+0x5e/0x90
 ? security_capable+0x3b/0x70
 rtnl_newlink+0x784/0xaf0
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? stack_depot_save_flags+0x24/0x6d0
 ? __pfx_rtnl_newlink+0x10/0x10
 rtnetlink_rcv_msg+0x34f/0x3f0
 ? do_syscall_64+0x6c/0x180
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 netlink_unicast+0x23e/0x390
 netlink_sendmsg+0x21e/0x490
 ____sys_sendmsg+0x31b/0x350
 ? copy_msghdr_from_user+0x6d/0xa0
 ___sys_sendmsg+0x86/0xd0
 ? __pte_offset_map+0x17/0x160
 ? preempt_count_add+0x69/0xa0
 ? __call_rcu_common.constprop.0+0x147/0x610
 ? preempt_count_add+0x69/0xa0
 ? preempt_count_add+0x69/0xa0
 ? _raw_spin_trylock+0x13/0x60
 ? trace_hardirqs_on+0x1d/0x80
 __sys_sendmsg+0x66/0xc0
 do_syscall_64+0x6c/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
 </TASK>
[...]

Fixes: c8bd1f7 ("virtio_net: add support for Byte Queue Limits")
Cc: <stable@vger.kernel.org> # v6.11+
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
[ pabeni: trimmed possibly troublesome separator ]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
…nts'

Koichiro Den says:

====================
virtio_net: correct netdev_tx_reset_queue() invocation points

When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
scenarios for virtio-net.

This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.

This patch series resolves the issue and also addresses similar existing
problems:

(a). Drop netdev_tx_reset_queue() from open/close path. This eliminates the
     BQL crashes due to the problematic open/close path.

(b). As a result of (a), netdev_tx_reset_queue() is now explicitly required
     in freeze/restore path. Add netdev_tx_reset_queue() immediately after
     free_unused_bufs() invocation.

(c). Fix missing resetting in virtnet_tx_resize().
     virtnet_tx_resize() has lacked proper resetting since commit
     c8bd1f7 ("virtio_net: add support for Byte Queue Limits").

(d). Fix missing resetting in the XDP_SETUP_XSK_POOL path.
     Similar to (c), this path lacked proper resetting. Call
     netdev_tx_reset_queue() when virtqueue_reset() has actually recycled
     unused buffers.

This patch series consists of six commits:
  [1/6]: Resolves (a) and (b).                      # also -stable 6.11.y
  [2/6]: Minor fix to make [4/6] streamlined.
  [3/6]: Prerequisite for (c).                      # also -stable 6.11.y
  [4/6]: Resolves (c) (incl. Prerequisite for (d))  # also -stable 6.11.y
  [5/6]: Preresuisite for (d).
  [6/6]: Resolves (d).

Changes for v4:
  - move netdev_tx_reset_queue() out of free_unused_bufs()
  - submit to net, not net-next
Changes for v3:
  - replace 'flushed' argument with 'recycle_done'
Changes for v2:
  - add tx queue resetting for (b) to (d) above

v3: https://lore.kernel.org/all/20241204050724.307544-1-koichiro.den@canonical.com/
v2: https://lore.kernel.org/all/20241203073025.67065-1-koichiro.den@canonical.com/
v1: https://lore.kernel.org/all/20241130181744.3772632-1-koichiro.den@canonical.com/

[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? die+0x32/0x80
 ? do_trap+0xd9/0x100
 ? dql_completed+0x26b/0x290
 ? dql_completed+0x26b/0x290
 ? do_error_trap+0x6d/0xb0
 ? dql_completed+0x26b/0x290
 ? exc_invalid_op+0x4c/0x60
 ? dql_completed+0x26b/0x290
 ? asm_exc_invalid_op+0x16/0x20
 ? dql_completed+0x26b/0x290
 __free_old_xmit+0xff/0x170 [virtio_net]
 free_old_xmit+0x54/0xc0 [virtio_net]
 virtnet_poll+0xf4/0xe30 [virtio_net]
 ? __update_load_avg_cfs_rq+0x264/0x2d0
 ? update_curr+0x35/0x260
 ? reweight_entity+0x1be/0x260
 __napi_poll.constprop.0+0x28/0x1c0
 net_rx_action+0x329/0x420
 ? enqueue_hrtimer+0x35/0x90
 ? trace_hardirqs_on+0x1d/0x80
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? sched_clock_cpu+0xd/0x1a0
 handle_softirqs+0x138/0x3e0
 do_softirq.part.0+0x89/0xc0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xa7/0xb0
 virtnet_open+0xc8/0x310 [virtio_net]
 __dev_open+0xfa/0x1b0
 __dev_change_flags+0x1de/0x250
 dev_change_flags+0x22/0x60
 do_setlink.isra.0+0x2df/0x10b0
 ? rtnetlink_rcv_msg+0x34f/0x3f0
 ? netlink_rcv_skb+0x54/0x100
 ? netlink_unicast+0x23e/0x390
 ? netlink_sendmsg+0x21e/0x490
 ? ____sys_sendmsg+0x31b/0x350
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? __nla_validate_parse+0x5f/0xee0
 ? __pfx___probestub_irq_enable+0x3/0x10
 ? __create_object+0x5e/0x90
 ? security_capable+0x3b/0x7�[I0
 rtnl_newlink+0x784/0xaf0
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? stack_depot_save_flags+0x24/0x6d0
 ? __pfx_rtnl_newlink+0x10/0x10
 rtnetlink_rcv_msg+0x34f/0x3f0
 ? do_syscall_64+0x6c/0x180
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 netlink_unicast+0x23e/0x390
 netlink_sendmsg+0x21e/0x490
 ____sys_sendmsg+0x31b/0x350
 ? copy_msghdr_from_user+0x6d/0xa0
 ___sys_sendmsg+0x86/0xd0
 ? __pte_offset_map+0x17/0x160
 ? preempt_count_add+0x69/0xa0
 ? __call_rcu_common.constprop.0+0x147/0x610
 ? preempt_count_add+0x69/0xa0
 ? preempt_count_add+0x69/0xa0
 ? _raw_spin_trylock+0x13/0x60
 ? trace_hardirqs_on+0x1d/0x80
 __sys_sendmsg+0x66/0xc0
 do_syscall_64+0x6c/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
 </TASK>
[...]
====================

Link: https://patch.msgid.link/20241206011047.923923-1-koichiro.den@canonical.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
The vmemmap's, which is used for RV64 with SPARSEMEM_VMEMMAP, page
tables are populated using pmd (page middle directory) hugetables.
However, the pmd allocation is not using the generic mechanism used by
the VMA code (e.g. pmd_alloc()), or the RISC-V specific
create_pgd_mapping()/alloc_pmd_late(). Instead, the vmemmap page table
code allocates a page, and calls vmemmap_set_pmd(). This results in
that the pmd ctor is *not* called, nor would it make sense to do so.

Now, when tearing down a vmemmap page table pmd, the cleanup code
would unconditionally, and incorrectly call the pmd dtor, which
results in a crash (best case).

This issue was found when running the HMM selftests:

  | tools/testing/selftests/mm# ./test_hmm.sh smoke
  | ... # when unloading the test_hmm.ko module
  | page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10915b
  | flags: 0x1000000000000000(node=0|zone=1)
  | raw: 1000000000000000 0000000000000000 dead000000000122 0000000000000000
  | raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  | page dumped because: VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte)
  | ------------[ cut here ]------------
  | kernel BUG at include/linux/mm.h:3080!
  | Kernel BUG [#1]
  | Modules linked in: test_hmm(-) sch_fq_codel fuse drm drm_panel_orientation_quirks backlight dm_mod
  | CPU: 1 UID: 0 PID: 514 Comm: modprobe Tainted: G        W          6.12.0-00982-gf2a4f1682d07 #2
  | Tainted: [W]=WARN
  | Hardware name: riscv-virtio qemu/qemu, BIOS 2024.10 10/01/2024
  | epc : remove_pgd_mapping+0xbec/0x1070
  |  ra : remove_pgd_mapping+0xbec/0x1070
  | epc : ffffffff80010a68 ra : ffffffff80010a68 sp : ff20000000a73940
  |  gp : ffffffff827b2d88 tp : ff6000008785da40 t0 : ffffffff80fbce04
  |  t1 : 0720072007200720 t2 : 706d756420656761 s0 : ff20000000a73a50
  |  s1 : ff6000008915cff8 a0 : 0000000000000039 a1 : 0000000000000008
  |  a2 : ff600003fff0de20 a3 : 0000000000000000 a4 : 0000000000000000
  |  a5 : 0000000000000000 a6 : c0000000ffffefff a7 : ffffffff824469b8
  |  s2 : ff1c0000022456c0 s3 : ff1ffffffdbfffff s4 : ff6000008915c000
  |  s5 : ff6000008915c000 s6 : ff6000008915c000 s7 : ff1ffffffdc00000
  |  s8 : 0000000000000001 s9 : ff1ffffffdc00000 s10: ffffffff819a31f0
  |  s11: ffffffffffffffff t3 : ffffffff8000c950 t4 : ff60000080244f00
  |  t5 : ff60000080244000 t6 : ff20000000a73708
  | status: 0000000200000120 badaddr: ffffffff80010a68 cause: 0000000000000003
  | [<ffffffff80010a68>] remove_pgd_mapping+0xbec/0x1070
  | [<ffffffff80fd238e>] vmemmap_free+0x14/0x1e
  | [<ffffffff8032e698>] section_deactivate+0x220/0x452
  | [<ffffffff8032ef7e>] sparse_remove_section+0x4a/0x58
  | [<ffffffff802f8700>] __remove_pages+0x7e/0xba
  | [<ffffffff803760d8>] memunmap_pages+0x2bc/0x3fe
  | [<ffffffff02a3ca28>] dmirror_device_remove_chunks+0x2ea/0x518 [test_hmm]
  | [<ffffffff02a3e026>] hmm_dmirror_exit+0x3e/0x1018 [test_hmm]
  | [<ffffffff80102c14>] __riscv_sys_delete_module+0x15a/0x2a6
  | [<ffffffff80fd020c>] do_trap_ecall_u+0x1f2/0x266
  | [<ffffffff80fde0a2>] _new_vmalloc_restore_context_a0+0xc6/0xd2
  | Code: bf51 7597 0184 8593 76a5 854a 4097 0029 80e7 2c00 (9002) 7597
  | ---[ end trace 0000000000000000 ]---
  | Kernel panic - not syncing: Fatal exception in interrupt

Add a check to avoid calling the pmd dtor, if the calling context is
vmemmap_free().

Fixes: c75a74f ("riscv: mm: Add memory hotplugging support")
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20241120131203.1859787-1-bjorn@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
This reworks hci_cb_list to not use mutex hci_cb_list_lock to avoid bugs
like the bellow:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5070, name: kworker/u9:2
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
4 locks held by kworker/u9:2/5070:
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x1770 kernel/workqueue.c:3335
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x1770 kernel/workqueue.c:3335
 #2: ffff8880665d0078 (&hdev->lock){+.+.}-{3:3}, at: hci_le_create_big_complete_evt+0xcf/0xae0 net/bluetooth/hci_event.c:6914
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: hci_le_create_big_complete_evt+0xdb/0xae0 net/bluetooth/hci_event.c:6915
CPU: 0 PID: 5070 Comm: kworker/u9:2 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: hci0 hci_rx_work
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 __might_resched+0x5d4/0x780 kernel/sched/core.c:10187
 __mutex_lock_common kernel/locking/mutex.c:585 [inline]
 __mutex_lock+0xc1/0xd70 kernel/locking/mutex.c:752
 hci_connect_cfm include/net/bluetooth/hci_core.h:2004 [inline]
 hci_le_create_big_complete_evt+0x3d9/0xae0 net/bluetooth/hci_event.c:6939
 hci_event_func net/bluetooth/hci_event.c:7514 [inline]
 hci_event_packet+0xa53/0x1540 net/bluetooth/hci_event.c:7569
 hci_rx_work+0x3e8/0xca0 net/bluetooth/hci_core.c:4171
 process_one_work kernel/workqueue.c:3254 [inline]
 process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
 kthread+0x2f0/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
 </TASK>

Reported-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Tested-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2fb0835e0c9cefc34614
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.

[   75.307983] ======================================================
[   75.307984] WARNING: possible circular locking dependency detected
[   75.307985] 6.12.0-rc6+ linux-sunxi#22 Not tainted
[   75.307987] ------------------------------------------------------
[   75.307987] kworker/u81:2/2623 is trying to acquire lock:
[   75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
               at: iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308021]
               but task is already holding lock:
[   75.308022] ffff8fdd61a10078 (&hdev->lock)
               at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308053]
               which lock already depends on the new lock.

[   75.308054]
               the existing dependency chain (in reverse order) is:
[   75.308055]
               -> #1 (&hdev->lock){+.+.}-{3:3}:
[   75.308057]        __mutex_lock+0xad/0xc50
[   75.308061]        mutex_lock_nested+0x1b/0x30
[   75.308063]        iso_sock_listen+0x143/0x5c0 [bluetooth]
[   75.308085]        __sys_listen_socket+0x49/0x60
[   75.308088]        __x64_sys_listen+0x4c/0x90
[   75.308090]        x64_sys_call+0x2517/0x25f0
[   75.308092]        do_syscall_64+0x87/0x150
[   75.308095]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.308098]
               -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[   75.308100]        __lock_acquire+0x155e/0x25f0
[   75.308103]        lock_acquire+0xc9/0x300
[   75.308105]        lock_sock_nested+0x32/0x90
[   75.308107]        iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308128]        hci_connect_cfm+0x6c/0x190 [bluetooth]
[   75.308155]        hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth]
[   75.308180]        hci_le_meta_evt+0xe7/0x200 [bluetooth]
[   75.308206]        hci_event_packet+0x21f/0x5c0 [bluetooth]
[   75.308230]        hci_rx_work+0x3ae/0xb10 [bluetooth]
[   75.308254]        process_one_work+0x212/0x740
[   75.308256]        worker_thread+0x1bd/0x3a0
[   75.308258]        kthread+0xe4/0x120
[   75.308259]        ret_from_fork+0x44/0x70
[   75.308261]        ret_from_fork_asm+0x1a/0x30
[   75.308263]
               other info that might help us debug this:

[   75.308264]  Possible unsafe locking scenario:

[   75.308264]        CPU0                CPU1
[   75.308265]        ----                ----
[   75.308265]   lock(&hdev->lock);
[   75.308267]                            lock(sk_lock-
                                                AF_BLUETOOTH-BTPROTO_ISO);
[   75.308268]                            lock(&hdev->lock);
[   75.308269]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
[   75.308270]
                *** DEADLOCK ***

[   75.308271] 4 locks held by kworker/u81:2/2623:
[   75.308272]  #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0},
                at: process_one_work+0x443/0x740
[   75.308276]  #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)),
                at: process_one_work+0x1ce/0x740
[   75.308280]  #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3}
                at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308304]  #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2},
                at: hci_connect_cfm+0x29/0x190 [bluetooth]

Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.

[  561.670344] ======================================================
[  561.670346] WARNING: possible circular locking dependency detected
[  561.670349] 6.12.0-rc6+ linux-sunxi#26 Not tainted
[  561.670351] ------------------------------------------------------
[  561.670353] iso-tester/3289 is trying to acquire lock:
[  561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
               at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670405]
               but task is already holding lock:
[  561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
               at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[  561.670450]
               which lock already depends on the new lock.

[  561.670452]
               the existing dependency chain (in reverse order) is:
[  561.670453]
               -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[  561.670458]        lock_acquire+0x7c/0xc0
[  561.670463]        lock_sock_nested+0x3b/0xf0
[  561.670467]        bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[  561.670510]        iso_sock_accept+0x271/0x830 [bluetooth]
[  561.670547]        do_accept+0x3dd/0x610
[  561.670550]        __sys_accept4+0xd8/0x170
[  561.670553]        __x64_sys_accept+0x74/0xc0
[  561.670556]        x64_sys_call+0x17d6/0x25f0
[  561.670559]        do_syscall_64+0x87/0x150
[  561.670563]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670567]
               -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[  561.670571]        lock_acquire+0x7c/0xc0
[  561.670574]        lock_sock_nested+0x3b/0xf0
[  561.670577]        iso_sock_listen+0x2de/0xf30 [bluetooth]
[  561.670617]        __sys_listen_socket+0xef/0x130
[  561.670620]        __x64_sys_listen+0xe1/0x190
[  561.670623]        x64_sys_call+0x2517/0x25f0
[  561.670626]        do_syscall_64+0x87/0x150
[  561.670629]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670632]
               -> #0 (&hdev->lock){+.+.}-{3:3}:
[  561.670636]        __lock_acquire+0x32ad/0x6ab0
[  561.670639]        lock_acquire.part.0+0x118/0x360
[  561.670642]        lock_acquire+0x7c/0xc0
[  561.670644]        __mutex_lock+0x18d/0x12f0
[  561.670647]        mutex_lock_nested+0x1b/0x30
[  561.670651]        iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670687]        iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[  561.670722]        sock_recvmsg+0x1d5/0x240
[  561.670725]        sock_read_iter+0x27d/0x470
[  561.670727]        vfs_read+0x9a0/0xd30
[  561.670731]        ksys_read+0x1a8/0x250
[  561.670733]        __x64_sys_read+0x72/0xc0
[  561.670736]        x64_sys_call+0x1b12/0x25f0
[  561.670738]        do_syscall_64+0x87/0x150
[  561.670741]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670744]
               other info that might help us debug this:

[  561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH

[  561.670751]  Possible unsafe locking scenario:

[  561.670753]        CPU0                    CPU1
[  561.670754]        ----                    ----
[  561.670756]   lock(sk_lock-AF_BLUETOOTH);
[  561.670758]                                lock(sk_lock
                                              AF_BLUETOOTH-BTPROTO_ISO);
[  561.670761]                                lock(sk_lock-AF_BLUETOOTH);
[  561.670764]   lock(&hdev->lock);
[  561.670767]
                *** DEADLOCK ***

Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.

| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
|  arch_stack_walk+0x4a0/0x4c0 (P)
|  arch_stack_walk+0x248/0x4c0 (L)
|  profile_pc+0x44/0x80
|  profile_tick+0x50/0x80 (F)
|  tick_nohz_handler+0xcc/0x160 (F)
|  __hrtimer_run_queues+0x2ac/0x340 (F)
|  hrtimer_interrupt+0xf4/0x268 (F)
|  arch_timer_handler_virt+0x34/0x60 (F)
|  handle_percpu_devid_irq+0x88/0x220 (F)
|  generic_handle_domain_irq+0x34/0x60 (F)
|  gic_handle_irq+0x54/0x140 (F)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0x98
|  el1_interrupt+0x34/0x68 (F)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  queued_spin_lock_slowpath+0x78/0x460 (P)

The warning in question is:

  WARN_ON_ONCE(state->common.pc == orig_pc))

... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.

This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).

Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.

This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:

 BUG: kernel NULL pointer dereference, address: 0000000000000014
 PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0
 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2
 RIP: 0010:cache_tag_flush_range+0x9b/0x250
 Call Trace:
  <TASK>
  ? __die+0x1f/0x60
  ? page_fault_oops+0x163/0x590
  ? exc_page_fault+0x72/0x190
  ? asm_exc_page_fault+0x22/0x30
  ? cache_tag_flush_range+0x9b/0x250
  ? cache_tag_flush_range+0x5d/0x250
  intel_iommu_tlb_sync+0x29/0x40
  intel_iommu_unmap_pages+0xfe/0x160
  __iommu_unmap+0xd8/0x1a0
  vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1]
  vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1]

Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix
it.

Fixes: 3b1d9e2 ("iommu/vt-d: Add cache tag assignment interface")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241129020506.576413-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
jwrdegoede pushed a commit that referenced this pull request Dec 15, 2024
…s_lock

For storing a value to a queue attribute, the queue_attr_store function
first freezes the queue (->q_usage_counter(io)) and then acquire
->sysfs_lock. This seems not correct as the usual ordering should be to
acquire ->sysfs_lock before freezing the queue. This incorrect ordering
causes the following lockdep splat which we are able to reproduce always
simply by accessing /sys/kernel/debug file using ls command:

[   57.597146] WARNING: possible circular locking dependency detected
[   57.597154] 6.12.0-10553-gb86545e02e8c linux-sunxi#20 Tainted: G        W
[   57.597162] ------------------------------------------------------
[   57.597168] ls/4605 is trying to acquire lock:
[   57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0
[   57.597200]
               but task is already holding lock:
[   57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4
[   57.597226]
               which lock already depends on the new lock.

[   57.597233]
               the existing dependency chain (in reverse order) is:
[   57.597241]
               -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
[   57.597255]        down_write+0x6c/0x18c
[   57.597264]        start_creating+0xb4/0x24c
[   57.597274]        debugfs_create_dir+0x2c/0x1e8
[   57.597283]        blk_register_queue+0xec/0x294
[   57.597292]        add_disk_fwnode+0x2e4/0x548
[   57.597302]        brd_alloc+0x2c8/0x338
[   57.597309]        brd_init+0x100/0x178
[   57.597317]        do_one_initcall+0x88/0x3e4
[   57.597326]        kernel_init_freeable+0x3cc/0x6e0
[   57.597334]        kernel_init+0x34/0x1cc
[   57.597342]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597350]
               -> #4 (&q->debugfs_mutex){+.+.}-{4:4}:
[   57.597362]        __mutex_lock+0xfc/0x12a0
[   57.597370]        blk_register_queue+0xd4/0x294
[   57.597379]        add_disk_fwnode+0x2e4/0x548
[   57.597388]        brd_alloc+0x2c8/0x338
[   57.597395]        brd_init+0x100/0x178
[   57.597402]        do_one_initcall+0x88/0x3e4
[   57.597410]        kernel_init_freeable+0x3cc/0x6e0
[   57.597418]        kernel_init+0x34/0x1cc
[   57.597426]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597434]
               -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[   57.597446]        __mutex_lock+0xfc/0x12a0
[   57.597454]        queue_attr_store+0x9c/0x110
[   57.597462]        sysfs_kf_write+0x70/0xb0
[   57.597471]        kernfs_fop_write_iter+0x1b0/0x2ac
[   57.597480]        vfs_write+0x3dc/0x6e8
[   57.597488]        ksys_write+0x84/0x140
[   57.597495]        system_call_exception+0x130/0x360
[   57.597504]        system_call_common+0x160/0x2c4
[   57.597516]
               -> #2 (&q->q_usage_counter(io)linux-sunxi#21){++++}-{0:0}:
[   57.597530]        __submit_bio+0x5ec/0x828
[   57.597538]        submit_bio_noacct_nocheck+0x1e4/0x4f0
[   57.597547]        iomap_readahead+0x2a0/0x448
[   57.597556]        xfs_vm_readahead+0x28/0x3c
[   57.597564]        read_pages+0x88/0x41c
[   57.597571]        page_cache_ra_unbounded+0x1ac/0x2d8
[   57.597580]        filemap_get_pages+0x188/0x984
[   57.597588]        filemap_read+0x13c/0x4bc
[   57.597596]        xfs_file_buffered_read+0x88/0x17c
[   57.597605]        xfs_file_read_iter+0xac/0x158
[   57.597614]        vfs_read+0x2d4/0x3b4
[   57.597622]        ksys_read+0x84/0x144
[   57.597629]        system_call_exception+0x130/0x360
[   57.597637]        system_call_common+0x160/0x2c4
[   57.597647]
               -> #1 (mapping.invalidate_lock#2){++++}-{4:4}:
[   57.597661]        down_read+0x6c/0x220
[   57.597669]        filemap_fault+0x870/0x100c
[   57.597677]        xfs_filemap_fault+0xc4/0x18c
[   57.597684]        __do_fault+0x64/0x164
[   57.597693]        __handle_mm_fault+0x1274/0x1dac
[   57.597702]        handle_mm_fault+0x248/0x484
[   57.597711]        ___do_page_fault+0x428/0xc0c
[   57.597719]        hash__do_page_fault+0x30/0x68
[   57.597727]        do_hash_fault+0x90/0x35c
[   57.597736]        data_access_common_virt+0x210/0x220
[   57.597745]        _copy_from_user+0xf8/0x19c
[   57.597754]        sel_write_load+0x178/0xd54
[   57.597762]        vfs_write+0x108/0x6e8
[   57.597769]        ksys_write+0x84/0x140
[   57.597777]        system_call_exception+0x130/0x360
[   57.597785]        system_call_common+0x160/0x2c4
[   57.597794]
               -> #0 (&mm->mmap_lock){++++}-{4:4}:
[   57.597806]        __lock_acquire+0x17cc/0x2330
[   57.597814]        lock_acquire+0x138/0x400
[   57.597822]        __might_fault+0x7c/0xc0
[   57.597830]        filldir64+0xe8/0x390
[   57.597839]        dcache_readdir+0x80/0x2d4
[   57.597846]        iterate_dir+0xd8/0x1d4
[   57.597855]        sys_getdents64+0x88/0x2d4
[   57.597864]        system_call_exception+0x130/0x360
[   57.597872]        system_call_common+0x160/0x2c4
[   57.597881]
               other info that might help us debug this:

[   57.597888] Chain exists of:
                 &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3

[   57.597905]  Possible unsafe locking scenario:

[   57.597911]        CPU0                    CPU1
[   57.597917]        ----                    ----
[   57.597922]   rlock(&sb->s_type->i_mutex_key#3);
[   57.597932]                                lock(&q->debugfs_mutex);
[   57.597940]                                lock(&sb->s_type->i_mutex_key#3);
[   57.597950]   rlock(&mm->mmap_lock);
[   57.597958]
                *** DEADLOCK ***

[   57.597965] 2 locks held by ls/4605:
[   57.597971]  #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154
[   57.597989]  #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4

Prevent the above lockdep warning by acquiring ->sysfs_lock before
freezing the queue while storing a queue attribute in queue_attr_store
function. Later, we also found[1] another function __blk_mq_update_nr_
hw_queues where we first freeze queue and then acquire the ->sysfs_lock.
So we've also updated lock ordering in __blk_mq_update_nr_hw_queues
function and ensured that in all code paths we follow the correct lock
ordering i.e. acquire ->sysfs_lock before freezing the queue.

[1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/

Reported-by: kjain@linux.ibm.com
Fixes: af28141 ("block: freeze the queue in queue_attr_store")
Tested-by: kjain@linux.ibm.com
Cc: hch@lst.de
Cc: axboe@kernel.dk
Cc: ritesh.list@gmail.com
Cc: ming.lei@redhat.com
Cc: gjoyce@linux.ibm.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
jwrdegoede pushed a commit that referenced this pull request Dec 21, 2024
…ux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.13, part #2

 - Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
   SPE/TRBE initialization

 - Align nested page table walker with the intended memory attribute
   combining rules of the architecture

 - Prevent userspace from constraining the advertised ASID width,
   avoiding horrors of guest TLBIs not matching the intended context in
   hardware

 - Don't leak references on LPIs when insertion into the translation
   cache fails
jwrdegoede pushed a commit that referenced this pull request Dec 21, 2024
Guangguan Wang says:

====================
net: several fixes for smc

v1 -> v2:
rewrite patch #2 suggested by Paolo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant