Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

同步torvalds的内核修改。 #1

Merged
merged 10,000 commits into from
Oct 25, 2014
Merged

同步torvalds的内核修改。 #1

merged 10,000 commits into from
Oct 25, 2014

Conversation

wutiejun
Copy link
Owner

No description provided.

davejiang and others added 30 commits October 17, 2014 07:08
Move the platform detection function to separate functions to allow
easier maintenence.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
To simplify some of the platform detection code. Move the platform detection
to a function to be called earlier.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Instead of using a module parameter, we should detect the errata via
PCI DID and then set an appropriate flag. This will be used for additional
errata later on.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
On the Haswell platform, a split BAR option to allow creation of 2
32bit BARs (4 and 5) from the 64bit BAR 4. Adding support for this
new option.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If kprobes is disabled uprobes will not compile.
Fix this by including the correct header files.

Signed-off-by: Jan Willeke <willeke@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
So that when an evsel is embedded into other struct it can free up
resources calling perf_evsel__exit().

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-n1w68pfe9m2vkhm4sqs8y1en@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It was being found, by chance, because evsel.h needlessly includes
util/cgroup.h, which will be sorted out in a following patch.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-xsvxr747wkkpg1ay9dramorr@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The only thing we need is a forward declaration for 'struct cgroup_sel',
that is inside 'struct perf_evsel'.

Include cgroup.h instead on the tools that support cgroups.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-b7kuymbgf0zxi5viyjjtu5hk@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
zatimend has reported that in his environment (3.16/gcc4.8.3/corei7)
memset() calls which clear out sensitive data in extract_{buf,entropy,
entropy_user}() in random driver are being optimized away by gcc.

Add a helper memzero_explicit() (similarly as explicit_bzero() variants)
that can be used in such cases where a variable with sensitive data is
being cleared out in the end. Other use cases might also be in crypto
code. [ I have put this into lib/string.c though, as it's always built-in
and doesn't need any dependencies then. ]

Fixes kernel bugzilla: 82041

Reported-by: zatimend@hotmail.co.uk
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Recently, in commit 13aa93c ("random: add and use memzero_explicit()
for clearing data"), we have found that GCC may optimize some memset()
cases away when it detects a stack variable is not being used anymore
and going out of scope. This can happen, for example, in cases when we
are clearing out sensitive information such as keying material or any
e.g. intermediate results from crypto computations, etc.

With the help of Coccinelle, we can figure out and fix such occurences
in the crypto subsytem as well. Julia Lawall provided the following
Coccinelle program:

  @@
  type T;
  identifier x;
  @@

  T x;
  ... when exists
      when any
  -memset
  +memzero_explicit
     (&x,
  -0,
     ...)
  ... when != x
      when strict

  @@
  type T;
  identifier x;
  @@

  T x[...];
  ... when exists
      when any
  -memset
  +memzero_explicit
     (x,
  -0,
     ...)
  ... when != x
      when strict

Therefore, make use of the drop-in replacement memzero_explicit() for
exactly such cases instead of using memset().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This simplifies the lanai.c driver by using
the module_pci_driver() macro, at the expense
of losing only debugging messages.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 971f10e ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
missed that cookie_v4_check() still calls ip_options_echo() which uses
IPCB(). It should use TCPCB() at TCP layer, so call __ip_options_echo()
instead.

Fixes: commit 971f10e ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Reported-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cookie_v4_check() allocates ip_options_rcu in the same way
with tcp_v4_save_options(), we can just make it a helper function.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can retrieve opt from skb, no need to pass it as a parameter.
And opt should always be non-NULL, no need to check.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Tested-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding period data column to be displayed in perf script.  It's possible
to get period values using -f option, like:

  $ perf script -f comm,tid,time,period,ip,sym,dso
          :26019 26019 52414.329088:       3707  ffffffff8105443a native_write_msr_safe ([kernel.kallsyms])
          :26019 26019 52414.329088:         44  ffffffff8105443a native_write_msr_safe ([kernel.kallsyms])
          :26019 26019 52414.329093:       1987  ffffffff8105443a native_write_msr_safe ([kernel.kallsyms])
          :26019 26019 52414.329093:          6  ffffffff8105443a native_write_msr_safe ([kernel.kallsyms])
              ls 26019 52414.329442:     537558        3407c0639c _dl_map_object_from_fd (/usr/lib64/ld-2.17.so)
              ls 26019 52414.329442:       2099        3407c0639c _dl_map_object_from_fd (/usr/lib64/ld-2.17.so)
              ls 26019 52414.330181:    1242100        34080917bb get_next_seq (/usr/lib64/libc-2.17.so)
              ls 26019 52414.330181:       3774        34080917bb get_next_seq (/usr/lib64/libc-2.17.so)
              ls 26019 52414.331427:    1083662  ffffffff810c7dc2 update_curr ([kernel.kallsyms])
              ls 26019 52414.331427:        360  ffffffff810c7dc2 update_curr ([kernel.kallsyms])

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: "Jen-Cheng(Tommy) Huang" <tommy24@gatech.edu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jen-Cheng(Tommy) Huang <tommy24@gatech.edu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1408977943-16594-9-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding period as a default output column in script command fo hardware,
software and raw events.

If PERF_SAMPLE_PERIOD sample type is defined in perf.data, following
will be displayed in perf script output:

  $ perf script
              ls  8034 57477.887209:     250000 task-clock:  ffffffff81361d72 memset ([kernel.kallsyms])
              ls  8034 57477.887464:     250000 task-clock:  ffffffff816f6d92 _raw_spin_unlock_irqrestore ([kernel.kallsyms])
              ls  8034 57477.887708:     250000 task-clock:  ffffffff811a94f0 do_munmap ([kernel.kallsyms])
              ls  8034 57477.887959:     250000 task-clock:        34080916c6 get_next_seq (/usr/lib64/libc-2.17.so)
              ls  8034 57477.888208:     250000 task-clock:        3408079230 _IO_doallocbuf (/usr/lib64/libc-2.17.so)
              ls  8034 57477.888717:     250000 task-clock:  ffffffff814242c8 n_tty_write ([kernel.kallsyms])
              ls  8034 57477.889285:     250000 task-clock:        3408076402 fwrite_unlocked (/usr/lib64/libc-2.17.so)

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: "Jen-Cheng(Tommy) Huang" <tommy24@gatech.edu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jen-Cheng(Tommy) Huang <tommy24@gatech.edu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1408977943-16594-10-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
and in case errors in __ip_append_data() nobody frees stolen dst entry

Fixes: 2e77d89 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
Signed-off-by: Vasily Averin <vvs@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp
setting after arphdr_ok to avoid the use the freed memory

Fixes: 0714812 ("openvswitch: Eliminate memset() from flow_extract.")
Cc: Jesse Gross <jesse@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull maybe change skb->data and make eth pointer oboslete,
so eth needs to reload

Fixes: 91269e3 ("vxlan: using pskb_may_pull as early as possible")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If megaflows are disabled, the userspace does not send the netlink attribute
OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.

sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
bytes that represent padding for struct sw_flow, or the bytes that represent
fields that may not be set during ovs_flow_extract().
This is a problem, because when we extract a flow from a packet,
we do not memset() anymore the struct sw_flow to 0.

This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
which operates on the netlink attributes rather than on the mask key. Using
this approach we are sure that only the bytes that the user provided in the
flow are matched.

Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
now return with an error.

This bug is introduced by commit 0714812
("openvswitch: Eliminate memset() from flow_extract").

Reported-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven French <smfrench@gmail.com>
In case that the IP header has optional field at the end, this patch will
get the port numbers after that field, and compute the hash. The general
parser skb_flow_dissect() is used here.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() maybe change skb->data and make eth pointer oboslete,
so set eth after pskb_may_pull()

Fixes:3d7b46cd("ip_tunnel: push generic protocol handling to ip_tunnel module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() maybe change skb->data and make uh pointer oboslete,
so reload uh and guehdr

Fixes: 37dd024 ("gue: Receive side for Generic UDP Encapsulation")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove calling cancel_delayed_work_sync() for runtime suspend,
because it would cause dead lock. Instead, return -EBUSY to
avoid the device enters suspending if the net is running and
the delayed work is pending or running. The delayed work would
try to wake up the device later, so the suspending is not
necessary.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't ring the doorbell, and don't do PIO.  This will also prevent
 TX Push, because there will be more than one buffer waiting when
 the doorbell is rung.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 971f10e ("tcp: better TCP_SKB_CB layout to reduce cache line
misses") added a regression for SO_BINDTODEVICE on IPv6.

This is because we still use inet6_iif() which expects that IP6 control
block is still at the beginning of skb->cb[]

This patch adds tcp_v6_iif() helper and uses it where necessary.

Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
parameter to it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 971f10e ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit ec8a2e5 ("tipc: same receive
code path for connection protocol and data messages") we omitted the
the possiblilty that an arriving message extracted from a bundle buffer
may be a multicast message. Such messages need to be to be delivered to
the socket via a separate function, tipc_sk_mcast_rcv(). As a result,
small multicast messages arriving as members of a bundle buffer will be
silently dropped.

This commit corrects the error by considering this case in the function
tipc_link_bundle_rcv().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b4d2394 ("dsa: Replace mii_bus with a generic host device")
replaces mii_bus with a generic host_dev, and introduces
dsa_host_dev_to_mii_bus() to support conversion from host_dev to mii_bus.
However, in some cases it uses to_mii_bus to perform that conversion.
Since host_dev is not the phy bus device but typically a platform device,
this fails and results in a crash with the affected drivers.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81781d35>] __mutex_lock_slowpath+0x75/0x100
PGD 406783067 PUD 406784067 PMD 0
Oops: 0002 [#1] SMP
...
Call Trace:
[<ffffffff810a538b>] ? pick_next_task_fair+0x61b/0x880
[<ffffffff81781de3>] mutex_lock+0x23/0x37
[<ffffffff81533244>] mdiobus_read+0x34/0x60
[<ffffffff8153b95a>] __mv88e6xxx_reg_read+0x8a/0xa0
[<ffffffff8153b9bc>] mv88e6xxx_reg_read+0x4c/0xa0

Fixes: b4d2394 ("dsa: Replace mii_bus with a generic host device")
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ralfbaechle and others added 21 commits October 24, 2014 13:27
A platform driver for which nothing ever registers the corresponding
platform device.

Also it was driving the same hardware as sead3-i2c-drv.c so redundant
anyway and couldn't co-exist with that driver because each of them was
using a private spinlock to protect access to the same hardware
resources.

This also fixes a randconfig problem:

arch/mips/mti-sead3/sead3-pic32-i2c-drv.c: In function 'i2c_platform_probe':
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c:345:2: error: implicit declaration of
function 'i2c_add_numbered_adapter' [-Werror=implicit-function-declaration]
  ret = i2c_add_numbered_adapter(&priv->adap);
    ^
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c: In function
'i2c_platform_remove':
arch/mips/mti-sead3/sead3-pic32-i2c-drv.c:361:2: error: implicit declaration
of function 'i2c_del_adapter' [-Werror=implicit-function-declaration]
i2c_del_adapter(&priv->adap);

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
A failure to decode the instruction can cause a NULL pointer access.
This is fixed simply by moving the "done" label as close as possible
to the return.

This fixes CVE-2014-8481.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Fixes: 41061cd
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, all group15 instructions are decoded as clflush (e.g., mfence,
xsave).  In addition, the clflush instruction requires no prefix (66/f2/f3)
would exist. If prefix exists it may encode a different instruction (e.g.,
clflushopt).

Creating a group for clflush, and different group for each prefix.

This has been the case forever, but the next patch needs the cflush group
in order to fix a bug introduced in 3.17.

Fixes: 41061cd
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The decode phase of the x86 emulator assumes that every instruction with the
ModRM flag, and which can be used with RIP-relative addressing, has either
SrcMem or DstMem.  This is not the case for several instructions - prefetch,
hint-nop and clflush.

Adding SrcMem|NoAccess for prefetch and hint-nop and SrcMem for clflush.

This fixes CVE-2014-8480.

Fixes: 41061cd
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The third parameter of kvm_unpin_pages() when called from
kvm_iommu_map_pages() is wrong, it should be the number of pages to un-pin
and not the page size.

This error was facilitated with an inconsistent API: kvm_pin_pages() takes
a size, but kvn_unpin_pages() takes a number of pages, so fix the problem
by matching the two.

This was introduced by commit 350b8bd ("kvm: iommu: fix the third parameter
of kvm_iommu_put_pages (CVE-2014-3601)"), which fixes the lack of
un-pinning for pages intended to be un-pinned (i.e. memory leak) but
unfortunately potentially aggravated the number of pages we un-pin that
should have stayed pinned. As far as I understand though, the same
practical mitigations apply.

This issue was found during review of Red Hat 6.6 patches to prepare
Ksplice rebootless updates.

Thanks to Vegard for his time on a late Friday evening to help me in
understanding this code.

Fixes: 350b8bd ("kvm: iommu: fix the third parameter of... (CVE-2014-3601)")
Cc: stable@vger.kernel.org
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even after the recent fix, the assertion on paging_tmpl.h is triggered.
Apparently, the assertion wants to check that the PAE is always set on
long-mode, but does it in incorrect way.  Note that the assertion is not
enabled unless the code is debugged by defining MMU_DEBUG.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
After commit 80ce163 (KVM: VFIO: register kvm_device_ops dynamically),
kvm_device_ops of vfio can be registered dynamically. Commit 3c3c29f
(kvm-vfio: do not use module_init) move the dynamic register invoked by
kvm_init in order to fix broke unloading of the kvm module. However,
kvm_device_ops of vfio is unregistered after rmmod kvm-intel module
which lead to device type collision detection warning after kvm-intel
module reinsmod.

    WARNING: CPU: 1 PID: 10358 at /root/cathy/kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:3289 kvm_init+0x234/0x282 [kvm]()
    Modules linked in: kvm_intel(O+) kvm(O) nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs fscache lockd sunrpc pci_stub bridge stp llc autofs4 8021q cpufreq_ondemand ipv6 joydev microcode pcspkr igb i2c_algo_bit ehci_pci ehci_hcd e1000e i2c_i801 ixgbe ptp pps_core hwmon mdio tpm_tis tpm ipmi_si ipmi_msghandler acpi_cpufreq isci libsas scsi_transport_sas button dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kvm_intel]
    CPU: 1 PID: 10358 Comm: insmod Tainted: G        W  O   3.17.0-rc1 #2
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
     0000000000000cd9 ffff880ff08cfd18 ffffffff814a61d9 0000000000000cd9
     0000000000000000 ffff880ff08cfd58 ffffffff810417b7 ffff880ff08cfd48
     ffffffffa045bcac ffffffffa049c420 0000000000000040 00000000000000ff
    Call Trace:
     [<ffffffff814a61d9>] dump_stack+0x49/0x60
     [<ffffffff810417b7>] warn_slowpath_common+0x7c/0x96
     [<ffffffffa045bcac>] ? kvm_init+0x234/0x282 [kvm]
     [<ffffffff810417e6>] warn_slowpath_null+0x15/0x17
     [<ffffffffa045bcac>] kvm_init+0x234/0x282 [kvm]
     [<ffffffffa016e995>] vmx_init+0x1bf/0x42a [kvm_intel]
     [<ffffffffa016e7d6>] ? vmx_check_processor_compat+0x64/0x64 [kvm_intel]
     [<ffffffff810002ab>] do_one_initcall+0xe3/0x170
     [<ffffffff811168a9>] ? __vunmap+0xad/0xb8
     [<ffffffff8109c58f>] do_init_module+0x2b/0x174
     [<ffffffff8109d414>] load_module+0x43e/0x569
     [<ffffffff8109c6d8>] ? do_init_module+0x174/0x174
     [<ffffffff8109c75a>] ? copy_module_from_user+0x39/0x82
     [<ffffffff8109b7dd>] ? module_sect_show+0x20/0x20
     [<ffffffff8109d65f>] SyS_init_module+0x54/0x81
     [<ffffffff814a9a12>] system_call_fastpath+0x16/0x1b
    ---[ end trace 0626f4a3ddea56f3 ]---

The bug can be reproduced by:

    rmmod kvm_intel.ko
    insmod kvm_intel.ko

without rmmod/insmod kvm.ko
This patch fixes the bug by unregistering kvm_device_ops of vfio when the
kvm-intel module is removed.

Reported-by: Liu Rongrong <rongrongx.liu@intel.com>
Fixes: 3c3c29f
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This isn't a module and shouldn't be one.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
When user asks to turn off ASLR by writing "0" to
/proc/sys/kernel/randomize_va_space there should not be
any randomization to mmap base, stack, VDSO, libs, text and heap

Currently arm64 violates this behavior by randomising text.
Fix this by defining a constant ELF_ET_DYN_BASE. The randomisation of
mm->mmap_base is done by setup_new_exec -> arch_pick_mmap_layout ->
mmap_base -> mmap_rnd.

Signed-off-by: Arun Chandran <achandran@mvista.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
…g boot.

Meelis Roos reported that kernels built with gcc-4.9 do not boot, we
eventually narrowed this down to only impacting machines using
UltraSPARC-III and derivitive cpus.

The crash happens right when the first user process is spawned:

[   54.451346] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004
[   54.451346]
[   54.571516] CPU: 1 PID: 1 Comm: init Not tainted 3.16.0-rc2-00211-gd7933ab #96
[   54.666431] Call Trace:
[   54.698453]  [0000000000762f8c] panic+0xb0/0x224
[   54.759071]  [000000000045cf68] do_exit+0x948/0x960
[   54.823123]  [000000000042cbc0] fault_in_user_windows+0xe0/0x100
[   54.902036]  [0000000000404ad0] __handle_user_windows+0x0/0x10
[   54.978662] Press Stop-A (L1-A) to return to the boot prom
[   55.050713] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004

Further investigation showed that compiling only per_cpu_patch() with
an older compiler fixes the boot.

Detailed analysis showed that the function is not being miscompiled by
gcc-4.9, but it is using a different register allocation ordering.

With the gcc-4.9 compiled function, something during the code patching
causes some of the %i* input registers to get corrupted.  Perhaps
we have a TLB miss path into the firmware that is deep enough to
cause a register window spill and subsequent restore when we get
back from the TLB miss trap.

Let's plug this up by doing two things:

1) Stop using the firmware stack for client interface calls into
   the firmware.  Just use the kernel's stack.

2) As soon as we can, call into a new function "start_early_boot()"
   to put a one-register-window buffer between the firmware's
   deepest stack frame and the top-most initial kernel one.

Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not sufficient to only implement get_user_pages_fast(), you
must also implement the atomic version __get_user_pages_fast()
otherwise you end up using the weak symbol fallback implementation
which simply returns zero.

This is dangerous, because it causes the futex code to loop forever
if transparent hugepages are supported (see get_futex_key()).

Signed-off-by: David S. Miller <davem@davemloft.net>
With 48-bit VA space, the 64K page configuration uses 3 levels instead
of 2 and PUD_SIZE != PMD_SIZE. Since with 64K pages we only cover
PMD_SIZE with the initial swapper_pg_dir populated in head.S, the
memblock current_limit needs to be set accordingly in map_mem() to avoid
allocating unmapped memory. The memblock current_limit is progressively
increased as more blocks are mapped.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
…rzhang/linux

Pull thermal management updates from Zhang Rui:
 "Sorry that I missed the merge window as there is a bug found in the
  last minute, and I have to fix it and wait for the code to be tested
  in linux-next tree for a few days.  Now the buggy patch has been
  dropped entirely from my next branch.  Thus I hope those changes can
  still be merged in 3.18-rc2 as most of them are platform thermal
  driver changes.

  Specifics:

   - introduce ACPI INT340X thermal drivers.

     Newer laptops and tablets may have thermal sensors and other
     devices with thermal control capabilities that are exposed for the
     OS to use via the ACPI INT340x device objects.  Several drivers are
     introduced to expose the temperature information and cooling
     ability from these objects to user-space via the normal thermal
     framework.

     From: Lu Aaron, Lan Tianyu, Jacob Pan and Zhang Rui.

   - introduce a new thermal governor, which just uses a hysteresis to
     switch abruptly on/off a cooling device.  This governor can be used
     to control certain fan devices that can not be throttled but just
     switched on or off.  From: Peter Feuerer.

   - introduce support for some new thermal interrupt functions on
     i.MX6SX, in IMX thermal driver.  From: Anson, Huang.

   - introduce tracing support on thermal framework.  From: Punit
     Agrawal.

   - small fixes in OF thermal and thermal step_wise governor"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (25 commits)
  Thermal: int340x thermal: select ACPI fan driver
  Thermal: int3400_thermal: use acpi_thermal_rel parsing APIs
  Thermal: int340x_thermal: expose acpi thermal relationship tables
  Thermal: introduce int3403 thermal driver
  Thermal: introduce INT3402 thermal driver
  Thermal: move the KELVIN_TO_MILLICELSIUS macro to thermal.h
  ACPI / Fan: support INT3404 thermal device
  ACPI / Fan: add ACPI 4.0 style fan support
  ACPI / fan: convert to platform driver
  ACPI / fan: use acpi_device_xxx_power instead of acpi_bus equivelant
  ACPI / fan: remove no need check for device pointer
  ACPI / fan: remove unused macro
  Thermal: int3400 thermal: register to thermal framework
  Thermal: int3400 thermal: add capability to detect supporting UUIDs
  Thermal: introduce int3400 thermal driver
  ACPI: add ACPI_TYPE_LOCAL_REFERENCE support to acpi_extract_package()
  ACPI: make acpi_create_platform_device() an external API
  thermal: step_wise: fix: Prevent from binary overflow when trend is dropping
  ACPI: introduce ACPI int340x thermal scan handler
  thermal: Added Bang-bang thermal governor
  ...
…rnel/git/rafael/linux-pm

Pull ACPI and power management updates from Rafael Wysocki:
 "This is material that didn't make it to my 3.18-rc1 pull request for
  various reasons, mostly related to timing and travel (LinuxCon EU /
  LPC) plus a couple of fixes for recent bugs.

  The only really new thing here is the PM QoS class for memory
  bandwidth, but it is simple enough and users of it will be added in
  the next cycle.  One major change in behavior is that platform devices
  enumerated by ACPI will use 32-bit DMA mask by default.  Also included
  is an ACPICA update to a new upstream release, but that's mostly
  cleanups, changes in tools and similar.  The rest is fixes and
  cleanups mostly.

  Specifics:

   - Fix for a recent PCI power management change that overlooked the
     fact that some IRQ chips might not be able to configure PCIe PME
     for system wakeup from Lucas Stach.

   - Fix for a bug introduced in 3.17 where acpi_device_wakeup() is
     called with a wrong ordering of arguments from Zhang Rui.

   - A bunch of intel_pstate driver fixes (all -stable candidates) from
     Dirk Brandewie, Gabriele Mazzotta and Pali Rohár.

   - Fixes for a rather long-standing problem with the OOM killer and
     the freezer that frozen processes killed by the OOM do not actually
     release any memory until they are thawed, so OOM-killing them is
     rather pointless, with a couple of cleanups on top (Michal Hocko,
     Cong Wang, Rafael J Wysocki).

   - ACPICA update to upstream release 20140926, inlcuding mostly
     cleanups reducing differences between the upstream ACPICA and the
     kernel code, tools changes (acpidump, acpiexec) and support for the
     _DDN object (Bob Moore, Lv Zheng).

   - New PM QoS class for memory bandwidth from Tomeu Vizoso.

   - Default 32-bit DMA mask for platform devices enumerated by ACPI
     (this change is mostly needed for some drivers development in
     progress targeted at 3.19) from Heikki Krogerus.

   - ACPI EC driver cleanups, mostly related to debugging, from Lv
     Zheng.

   - cpufreq-dt driver updates from Thomas Petazzoni.

   - powernv cpuidle driver update from Preeti U Murthy"

* tag 'pm+acpi-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (34 commits)
  intel_pstate: Correct BYT VID values.
  intel_pstate: Fix BYT frequency reporting
  intel_pstate: Don't lose sysfs settings during cpu offline
  cpufreq: intel_pstate: Reflect current no_turbo state correctly
  cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers
  cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy
  PCI / PM: handle failure to enable wakeup on PCIe PME
  ACPI: invoke acpi_device_wakeup() with correct parameters
  PM / freezer: Clean up code after recent fixes
  PM: convert do_each_thread to for_each_process_thread
  OOM, PM: OOM killed task shouldn't escape PM suspend
  freezer: remove obsolete comments in __thaw_task()
  freezer: Do not freeze tasks killed by OOM killer
  ACPI / platform: provide default DMA mask
  cpuidle: powernv: Populate cpuidle state details by querying the device-tree
  cpufreq: cpufreq-dt: adjust message related to regulators
  cpufreq: cpufreq-dt: extend with platform_data
  cpufreq: allow driver-specific data
  ACPI / EC: Cleanup coding style.
  ACPI / EC: Refine event/query debugging messages.
  ...
…rnel/git/tytso/random

Pull /dev/random updates from Ted Ts'o:
 "This adds a memzero_explicit() call which is guaranteed not to be
  optimized away by GCC.  This is important when we are wiping
  cryptographically sensitive material"

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  crypto: memzero_explicit - make sure to clear out sensitive data
  random: add and use memzero_explicit() for clearing data
…el/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "Here are a chunk of small fixes since rc1: two PCM core fixes, one is
  a long-standing annoyance about lockdep and another is an ARM64 mmap
  fix.

  The rest are a HD-audio HDMI hotplug notification fix, a fix for
  missing NULL termination in Realtek codec quirks and a few new
  device/codec-specific quirks as usual"

* tag 'sound-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Add missing terminating entry to SND_HDA_PIN_QUIRK macro
  ALSA: pcm: Fix false lockdep warnings
  ALSA: hda - Fix inverted LED gpio setup for Lenovo Ideapad
  ALSA: hda - hdmi: Fix missing ELD change event on plug/unplug
  ALSA: usb-audio: Add support for Steinberg UR22 USB interface
  ALSA: ALC283 codec - Avoid pop noise on headphones during suspend/resume
  ALSA: pcm: use the same dma mmap codepath both for arm and arm64
…ub/scm/linux/kernel/git/xen/tip

Pull xen bug fixes from David Vrabel:

 - Fix regression in xen_clocksource_read() which caused all Xen guests
   to crash early in boot.
 - Several fixes for super rare race conditions in the p2m.
 - Assorted other minor fixes.

* tag 'stable/for-linus-3.18-b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pci: Allocate memory for physdev_pci_device_add's optarr
  x86/xen: panic on bad Xen-provided memory map
  x86/xen: Fix incorrect per_cpu accessor in xen_clocksource_read()
  x86/xen: avoid race in p2m handling
  x86/xen: delay construction of mfn_list_list
  x86/xen: avoid writing to freed memory after race in p2m handling
  xen/balloon: Don't continue ballooning when BP_ECANCELED is encountered
Pull kvm fixes from Paolo Bonzini:
 "This is a pretty large update.  I think it is roughly as big as what I
  usually had for the _whole_ rc period.

  There are a few bad bugs where the guest can OOPS or crash the host.
  We have also started looking at attack models for nested
  virtualization; bugs that usually result in the guest ring 0 crashing
  itself become more worrisome if you have nested virtualization,
  because the nested guest might bring down the non-nested guest as
  well.  For current uses of nested virtualization these do not really
  have a security impact, but you never know and bugs are bugs
  nevertheless.

  A lot of these bugs are in 3.17 too, resulting in a large number of
  stable@ Ccs.  I checked that all the patches apply there with no
  conflicts"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: vfio: fix unregister kvm_device_ops of vfio
  KVM: x86: Wrong assertion on paging_tmpl.h
  kvm: fix excessive pages un-pinning in kvm_iommu_map error path.
  KVM: x86: PREFETCH and HINT_NOP should have SrcMem flag
  KVM: x86: Emulator does not decode clflush well
  KVM: emulate: avoid accessing NULL ctxt->memopp
  KVM: x86: Decoding guest instructions which cross page boundary may fail
  kvm: x86: don't kill guest on unknown exit reason
  kvm: vmx: handle invvpid vm exit gracefully
  KVM: x86: Handle errors when RIP is set during far jumps
  KVM: x86: Emulator fixes for eip canonical checks on near branches
  KVM: x86: Fix wrong masking on relative jump/call
  KVM: x86: Improve thread safety in pit
  KVM: x86: Prevent host from panicking on shared MSR writes.
  KVM: x86: Check non-canonical addresses upon WRMSR
Pull two sparc fixes from David Miller:

 1) Fix boots with gcc-4.9 compiled sparc64 kernels.

 2) Add missing __get_user_pages_fast() on sparc64 to fix hangs on
    futexes used in transparent hugepage areas.

    It's really idiotic to have a weak symbolled fallback that just
    returns zero, and causes this kind of bug.  There should be no
    backup implementation and the link should fail if the architecture
    fails to provide __get_user_pages_fast() and supports transparent
    hugepages.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Implement __get_user_pages_fast().
  sparc64: Fix register corruption in top-most kernel stack frame during boot.
…git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

 - enable 48-bit VA space now that KVM has been fixed, together with a
   couple of fixes for pgd allocation alignment and initial memblock
   current_limit.  There is still a dependency on !ARM_SMMU which needs
   to be updated as it uses the page table manipulation macros of the
   host kernel
 - eBPF fixes following changes/conflicts during the merging window
 - Compat types affecting compat_elf_prpsinfo
 - Compilation error on UP builds
 - ASLR fix when /proc/sys/kernel/randomize_va_space == 0
 - DT definitions for CLCD support on ARMv8 model platform

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: Fix memblock current_limit with 64K pages and 48-bit VA
  arm64: ASLR: Don't randomise text when randomise_va_space == 0
  arm64: vexpress: Add CLCD support to the ARMv8 model platform
  arm64: Fix compilation error on UP builds
  Documentation/arm64/memory.txt: fix typo
  net: bpf: arm64: minor fix of type in jited
  arm64: bpf: add 'load 64-bit immediate' instruction
  arm64: bpf: add 'shift by register' instructions
  net: bpf: arm64: address randomize and write protect JIT code
  arm64: mm: Correct fixmap pagetable types
  arm64: compat: fix compat types affecting struct compat_elf_prpsinfo
  arm64: Align less than PAGE_SIZE pgds naturally
  arm64: Allow 48-bits VA space without ARM_SMMU
…ream-linus

Pull MIPS fixes from Ralf Baechle:
 "This is the first round of fixes and tying up loose ends for MIPS.

   - plenty of fixes for build errors in specific obscure configurations
   - remove redundant code on the Lantiq platform
   - removal of a useless SEAD I2C driver that was causing a build issue
   - fix an earlier TLB exeption handler fix to also work on Octeon.
   - fix ISA level dependencies in FPU emulator's instruction decoding.
   - don't hardcode kernel command line in Octeon software emulator.
   - fix an earlier fix for the Loondson 2 clock setting"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
  MIPS: SEAD3: Fix I2C device registration.
  MIPS: SEAD3: Nuke PIC32 I2C driver.
  MIPS: ftrace: Fix a microMIPS build problem
  MIPS: MSP71xx: Fix build error
  MIPS: Malta: Do not build the malta-amon.c file if CMP is not enabled
  MIPS: Prevent compiler warning from cop2_{save,restore}
  MIPS: Kconfig: Add missing MIPS_CPS dependencies to PM and cpuidle
  MIPS: idle: Remove leftover __pastwait symbol and its references
  MIPS: Sibyte: Include the swarm subdir to the sb1250 LittleSur builds
  MIPS: ptrace.h: Add a missing include
  MIPS: ath79: Fix compilation error when CONFIG_PCI is disabled
  MIPS: MSP71xx: Remove compilation error when CONFIG_MIPS_MT is present
  MIPS: Octeon: Remove special case for simulator command line.
  MIPS: tlbex: Properly fix HUGE TLB Refill exception handler
  MIPS: loongson2_cpufreq: Fix CPU clock rate setting mismerge
  pci: pci-lantiq: remove duplicate check on resource
  MIPS: Lasat: Add missing CONFIG_PROC_FS dependency to PICVUE_PROC
  MIPS: cp1emu: Fix ISA restrictions for cop1x_op instructions
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
同步torvalds的内核修改。
@wutiejun wutiejun merged commit 5a0e3eb into wutiejun:master Oct 25, 2014
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
cat /sys/.../pools followed by removal the device leads to:

|======================================================
|[ INFO: possible circular locking dependency detected ]
|3.17.0-rc4+ #1498 Not tainted
|-------------------------------------------------------
|rmmod/2505 is trying to acquire lock:
| (s_active#28){++++.+}, at: [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88
|
|but task is already holding lock:
| (pools_lock){+.+.+.}, at: [<c011494c>] dma_pool_destroy+0x18/0x17c
|
|which lock already depends on the new lock.
|the existing dependency chain (in reverse order) is:
|
|-> #1 (pools_lock){+.+.+.}:
|   [<c0114ae8>] show_pools+0x30/0xf8
|   [<c0313210>] dev_attr_show+0x1c/0x48
|   [<c0180e84>] sysfs_kf_seq_show+0x88/0x10c
|   [<c017f960>] kernfs_seq_show+0x24/0x28
|   [<c013efc4>] seq_read+0x1b8/0x480
|   [<c011e820>] vfs_read+0x8c/0x148
|   [<c011ea10>] SyS_read+0x40/0x8c
|   [<c000e960>] ret_fast_syscall+0x0/0x48
|
|-> #0 (s_active#28){++++.+}:
|   [<c017e9ac>] __kernfs_remove+0x258/0x2ec
|   [<c017f754>] kernfs_remove_by_name_ns+0x3c/0x88
|   [<c0114a7c>] dma_pool_destroy+0x148/0x17c
|   [<c03ad288>] hcd_buffer_destroy+0x20/0x34
|   [<c03a4780>] usb_remove_hcd+0x110/0x1a4

The problem is the lock order of pools_lock and kernfs_mutex in
dma_pool_destroy() vs show_pools() call path.

This patch breaks out the creation of the sysfs file outside of the
pools_lock mutex.  The newly added pools_reg_lock ensures that there is no
race of create vs destroy code path in terms whether or not the sysfs file
has to be deleted (and was it deleted before we try to create a new one)
and what to do if device_create_file() failed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
…_map_entry

By the following commits, we prevented from allocating firmware_map_entry
of same memory range:
  f0093ed: drivers/firmware/memmap.c: don't allocate firmware_map_entry
            of same memory range
  49c8b24: drivers/firmware/memmap.c: pass the correct argument to
            firmware_map_find_entry_bootmem()

But it's not enough. When PNP0C80 device is added by acpi_scan_init(),
memmap sysfses of same firmware_map_entry are created twice as follows:

  # cat /sys/firmware/memmap/*/start
  0x40000000000
  0x60000000000
  0x4a837000
  0x4a83a000
  0x4a8b5000
  ...
  0x40000000000
  0x60000000000
  ...

The flows of the issues are as follows:

  1. e820_reserve_resources() allocates firmware_map_entrys of all
     memory ranges defined in e820. And, these firmware_map_entrys
     are linked with map_entries list.

     map_entries -> entry 1 -> ... -> entry N

  2. When PNP0C80 device is limited by mem= boot option, acpi_scan_init()
     added the memory device. In this case, firmware_map_add_hotplug()
     allocates firmware_map_entry and creates memmap sysfs.

     map_entries -> entry 1 -> ... -> entry N -> entry N+1
                                                 |
                                                 memmap 1

  3. firmware_memmap_init() creates memmap sysfses of firmware_map_entrys
     linked with map_entries.

     map_entries -> entry 1 -> ... -> entry N -> entry N+1
                     |                 |             |
                     memmap 2          memmap N+1    memmap 1
                                                     memmap N+2

So while hot removing the PNP0C80 device, kernel panic occurs as follows:

     BUG: unable to handle kernel paging request at 00000001003e000b
      IP: sysfs_open_file+0x46/0x2b0
      PGD 203a89fe067 PUD 0
      Oops: 0000 [#1] SMP
      ...
      Call Trace:
        do_dentry_open+0x1ef/0x2a0
        finish_open+0x31/0x40
        do_last+0x57c/0x1220
        path_openat+0xc2/0x4c0
        do_filp_open+0x4b/0xb0
        do_sys_open+0xf3/0x1f0
        SyS_open+0x1e/0x20
        system_call_fastpath+0x16/0x1b

The patch adds a check of confirming whether memmap sysfs of
firmware_map_entry has been created, and does not create memmap
sysfs of same firmware_map_entry.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
…ap()

Error report likely result in IO so it is bad idea to do it from
atomic context.

This patch should fix following issue:

BUG: sleeping function called from invalid context at include/linux/buffer_head.h:349
in_atomic(): 1, irqs_disabled(): 0, pid: 137, name: kworker/u128:1
5 locks held by kworker/u128:1/137:
 #0:  ("writeback"){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0
 #1:  ((&(&wb->dwork)->work)){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0
 #2:  (jbd2_handle){......}, at: [<ffffffff81242622>] start_this_handle+0x712/0x7b0
 #3:  (&ei->i_data_sem){......}, at: [<ffffffff811fa387>] ext4_map_blocks+0x297/0x430
 #4:  (&(&bgl->locks[i].lock)->rlock){......}, at: [<ffffffff811f3180>] ext4_read_block_bitmap_nowait+0x5d0/0x630
CPU: 3 PID: 137 Comm: kworker/u128:1 Not tainted 3.17.0-rc2-00184-g82752e4 torvalds#165
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
Workqueue: writeback bdi_writeback_workfn (flush-1:0)
 0000000000000411 ffff880813777288 ffffffff815c7fdc ffff880813777288
 ffff880813a8bba0 ffff8808137772a8 ffffffff8108fb30 ffff880803e01e38
 ffff880803e01e38 ffff8808137772c8 ffffffff811a8d53 ffff88080ecc6000
Call Trace:
 [<ffffffff815c7fdc>] dump_stack+0x51/0x6d
 [<ffffffff8108fb30>] __might_sleep+0xf0/0x100
 [<ffffffff811a8d53>] __sync_dirty_buffer+0x43/0xe0
 [<ffffffff811a8e03>] sync_dirty_buffer+0x13/0x20
 [<ffffffff8120f581>] ext4_commit_super+0x1d1/0x230
 [<ffffffff8120fa03>] save_error_info+0x23/0x30
 [<ffffffff8120fd06>] __ext4_error+0xb6/0xd0
 [<ffffffff8120f260>] ? ext4_group_desc_csum+0x140/0x190
 [<ffffffff811f2d8c>] ext4_read_block_bitmap_nowait+0x1dc/0x630
 [<ffffffff8122e23a>] ext4_mb_init_cache+0x21a/0x8f0
 [<ffffffff8113ae95>] ? lru_cache_add+0x55/0x60
 [<ffffffff8112e16c>] ? add_to_page_cache_lru+0x6c/0x80
 [<ffffffff8122eaa0>] ext4_mb_init_group+0x190/0x280
 [<ffffffff8122ec51>] ext4_mb_good_group+0xc1/0x190
 [<ffffffff8123309a>] ext4_mb_regular_allocator+0x17a/0x410
 [<ffffffff8122c821>] ? ext4_mb_use_preallocated+0x31/0x380
 [<ffffffff81233535>] ? ext4_mb_new_blocks+0x205/0x8e0
 [<ffffffff8116ed5c>] ? kmem_cache_alloc+0xfc/0x180
 [<ffffffff812335b0>] ext4_mb_new_blocks+0x280/0x8e0
 [<ffffffff8116f2c4>] ? __kmalloc+0x144/0x1c0
 [<ffffffff81221797>] ? ext4_find_extent+0x97/0x320
 [<ffffffff812257f4>] ext4_ext_map_blocks+0xbc4/0x1050
 [<ffffffff811fa387>] ? ext4_map_blocks+0x297/0x430
 [<ffffffff811fa3ab>] ext4_map_blocks+0x2bb/0x430
 [<ffffffff81200e43>] ? ext4_init_io_end+0x23/0x50
 [<ffffffff811feb44>] ext4_writepages+0x564/0xaf0
 [<ffffffff815cde3b>] ? _raw_spin_unlock+0x2b/0x40
 [<ffffffff810ac7bd>] ? lock_release_non_nested+0x2fd/0x3c0
 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490
 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490
 [<ffffffff811377e3>] do_writepages+0x23/0x40
 [<ffffffff8119c8ce>] __writeback_single_inode+0x9e/0x280
 [<ffffffff811a026b>] writeback_sb_inodes+0x2db/0x490
 [<ffffffff811a0664>] wb_writeback+0x174/0x2d0
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff811a0863>] wb_do_writeback+0xa3/0x200
 [<ffffffff811a0a40>] bdi_writeback_workfn+0x80/0x230
 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0
 [<ffffffff810856cd>] process_one_work+0x2dd/0x4d0
 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0
 [<ffffffff81085c1d>] worker_thread+0x35d/0x460
 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0
 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0
 [<ffffffff8108a885>] kthread+0xf5/0x100
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff8108a790>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815ce2ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff8108a790>] ? __init_kthread_work

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
…rder detected

remove spinlock in cpdma_desc_pool_destroy() as there is no active cpdma
channel and iounmap should be called without auquiring lock.

root@dra7xx-evm:~# modprobe -r ti_cpsw
[   50.539743]
[   50.541312] ======================================================
[   50.547796] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[   50.554826] 3.14.19-02124-g95c5b7b torvalds#308 Not tainted
[   50.559939] ------------------------------------------------------
[   50.566416] modprobe/1921 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[   50.573347]  (vmap_area_lock){+.+...}, at: [<c01127fc>] find_vmap_area+0x10/0x6c
[   50.581132]
[   50.581132] and this task is already holding:
[   50.587249]  (&(&pool->lock)->rlock#2){..-...}, at: [<bf017c74>] cpdma_ctlr_destroy+0x5c/0x114 [davinci_cpdma]
[   50.597766] which would create a new lock dependency:
[   50.603048]  (&(&pool->lock)->rlock#2){..-...} -> (vmap_area_lock){+.+...}
[   50.610296]
[   50.610296] but this new dependency connects a SOFTIRQ-irq-safe lock:
[   50.618601]  (&(&pool->lock)->rlock#2){..-...}
... which became SOFTIRQ-irq-safe at:
[   50.626829]   [<c06585a4>] _raw_spin_lock_irqsave+0x38/0x4c
[   50.632677]   [<bf01773c>] cpdma_desc_free.constprop.7+0x28/0x58 [davinci_cpdma]
[   50.640437]   [<bf0177e8>] __cpdma_chan_free+0x7c/0xa8 [davinci_cpdma]
[   50.647289]   [<bf017908>] __cpdma_chan_process+0xf4/0x134 [davinci_cpdma]
[   50.654512]   [<bf017984>] cpdma_chan_process+0x3c/0x54 [davinci_cpdma]
[   50.661455]   [<bf0277e8>] cpsw_poll+0x14/0xa8 [ti_cpsw]
[   50.667038]   [<c05844f4>] net_rx_action+0xc0/0x1e8
[   50.672150]   [<c0048234>] __do_softirq+0xcc/0x304
[   50.677183]   [<c004873c>] irq_exit+0xa8/0xfc
[   50.681751]   [<c000eeac>] handle_IRQ+0x50/0xb0
[   50.686513]   [<c0008638>] gic_handle_irq+0x28/0x5c
[   50.691628]   [<c06590a4>] __irq_svc+0x44/0x5c
[   50.696289]   [<c0658ab4>] _raw_spin_unlock_irqrestore+0x34/0x44
[   50.702591]   [<c065a9c4>] do_page_fault.part.9+0x144/0x3c4
[   50.708433]   [<c065acb8>] do_page_fault+0x74/0x84
[   50.713453]   [<c00083dc>] do_DataAbort+0x34/0x98
[   50.718391]   [<c065923c>] __dabt_usr+0x3c/0x40
[   50.723148]
[   50.723148] to a SOFTIRQ-irq-unsafe lock:
[   50.728893]  (vmap_area_lock){+.+...}
... which became SOFTIRQ-irq-unsafe at:
[   50.736476] ...  [<c06584e8>] _raw_spin_lock+0x28/0x38
[   50.741876]   [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   50.747908]   [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   50.754210]   [<c011486c>] get_vm_area_caller+0x3c/0x48
[   50.759692]   [<c0114be0>] vmap+0x40/0x78
[   50.763900]   [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   50.769835]   [<c093eac0>] start_kernel+0x320/0x388
[   50.774952]   [<80008074>] 0x80008074
[   50.778793]
[   50.778793] other info that might help us debug this:
[   50.778793]
[   50.787181]  Possible interrupt unsafe locking scenario:
[   50.787181]
[   50.794295]        CPU0                    CPU1
[   50.799042]        ----                    ----
[   50.803785]   lock(vmap_area_lock);
[   50.807446]                                local_irq_disable();
[   50.813652]                                lock(&(&pool->lock)->rlock#2);
[   50.820782]                                lock(vmap_area_lock);
[   50.827086]   <Interrupt>
[   50.829823]     lock(&(&pool->lock)->rlock#2);
[   50.834490]
[   50.834490]  *** DEADLOCK ***
[   50.834490]
[   50.840695] 4 locks held by modprobe/1921:
[   50.844981]  #0:  (&__lockdep_no_validate__){......}, at: [<c03e53e8>] driver_detach+0x44/0xb8
[   50.854038]  #1:  (&__lockdep_no_validate__){......}, at: [<c03e53f4>] driver_detach+0x50/0xb8
[   50.863102]  #2:  (&(&ctlr->lock)->rlock){......}, at: [<bf017c34>] cpdma_ctlr_destroy+0x1c/0x114 [davinci_cpdma]
[   50.873890]  #3:  (&(&pool->lock)->rlock#2){..-...}, at: [<bf017c74>] cpdma_ctlr_destroy+0x5c/0x114 [davinci_cpdma]
[   50.884871]
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[   50.892827] -> (&(&pool->lock)->rlock#2){..-...} ops: 167 {
[   50.898703]    IN-SOFTIRQ-W at:
[   50.901995]                     [<c06585a4>] _raw_spin_lock_irqsave+0x38/0x4c
[   50.909476]                     [<bf01773c>] cpdma_desc_free.constprop.7+0x28/0x58 [davinci_cpdma]
[   50.918878]                     [<bf0177e8>] __cpdma_chan_free+0x7c/0xa8 [davinci_cpdma]
[   50.927366]                     [<bf017908>] __cpdma_chan_process+0xf4/0x134 [davinci_cpdma]
[   50.936218]                     [<bf017984>] cpdma_chan_process+0x3c/0x54 [davinci_cpdma]
[   50.944794]                     [<bf0277e8>] cpsw_poll+0x14/0xa8 [ti_cpsw]
[   50.952009]                     [<c05844f4>] net_rx_action+0xc0/0x1e8
[   50.958765]                     [<c0048234>] __do_softirq+0xcc/0x304
[   50.965432]                     [<c004873c>] irq_exit+0xa8/0xfc
[   50.971635]                     [<c000eeac>] handle_IRQ+0x50/0xb0
[   50.978035]                     [<c0008638>] gic_handle_irq+0x28/0x5c
[   50.984788]                     [<c06590a4>] __irq_svc+0x44/0x5c
[   50.991085]                     [<c0658ab4>] _raw_spin_unlock_irqrestore+0x34/0x44
[   50.999023]                     [<c065a9c4>] do_page_fault.part.9+0x144/0x3c4
[   51.006510]                     [<c065acb8>] do_page_fault+0x74/0x84
[   51.013171]                     [<c00083dc>] do_DataAbort+0x34/0x98
[   51.019738]                     [<c065923c>] __dabt_usr+0x3c/0x40
[   51.026129]    INITIAL USE at:
[   51.029335]                    [<c06585a4>] _raw_spin_lock_irqsave+0x38/0x4c
[   51.036729]                    [<bf017d78>] cpdma_chan_submit+0x4c/0x2f0 [davinci_cpdma]
[   51.045225]                    [<bf02863c>] cpsw_ndo_open+0x378/0x6bc [ti_cpsw]
[   51.052897]                    [<c058747c>] __dev_open+0x9c/0x104
[   51.059287]                    [<c05876ec>] __dev_change_flags+0x88/0x160
[   51.066420]                    [<c05877e4>] dev_change_flags+0x18/0x48
[   51.073270]                    [<c05ed51c>] devinet_ioctl+0x61c/0x6e0
[   51.080029]                    [<c056ee54>] sock_ioctl+0x5c/0x298
[   51.086418]                    [<c01350a4>] do_vfs_ioctl+0x78/0x61c
[   51.092993]                    [<c01356ac>] SyS_ioctl+0x64/0x74
[   51.099200]                    [<c000e580>] ret_fast_syscall+0x0/0x48
[   51.105956]  }
[   51.107696]  ... key      at: [<bf019000>] __key.21312+0x0/0xfffff650 [davinci_cpdma]
[   51.115912]  ... acquired at:
[   51.119019]    [<c00899ac>] lock_acquire+0x9c/0x104
[   51.124138]    [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.129341]    [<c01127fc>] find_vmap_area+0x10/0x6c
[   51.134547]    [<c0114960>] remove_vm_area+0x8/0x6c
[   51.139659]    [<c0114a7c>] __vunmap+0x20/0xf8
[   51.144318]    [<c001c350>] __arm_iounmap+0x10/0x18
[   51.149440]    [<bf017d08>] cpdma_ctlr_destroy+0xf0/0x114 [davinci_cpdma]
[   51.156560]    [<bf026294>] cpsw_remove+0x48/0x8c [ti_cpsw]
[   51.162407]    [<c03e62c8>] platform_drv_remove+0x18/0x1c
[   51.168063]    [<c03e4c44>] __device_release_driver+0x70/0xc8
[   51.174094]    [<c03e5458>] driver_detach+0xb4/0xb8
[   51.179212]    [<c03e4a6c>] bus_remove_driver+0x4c/0x90
[   51.184693]    [<c00b024c>] SyS_delete_module+0x10c/0x198
[   51.190355]    [<c000e580>] ret_fast_syscall+0x0/0x48
[   51.195661]
[   51.197217]
the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[   51.205986] -> (vmap_area_lock){+.+...} ops: 520 {
[   51.211032]    HARDIRQ-ON-W at:
[   51.214321]                     [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.221090]                     [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   51.228750]                     [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   51.236690]                     [<c011486c>] get_vm_area_caller+0x3c/0x48
[   51.243811]                     [<c0114be0>] vmap+0x40/0x78
[   51.249654]                     [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   51.257239]                     [<c093eac0>] start_kernel+0x320/0x388
[   51.263994]                     [<80008074>] 0x80008074
[   51.269474]    SOFTIRQ-ON-W at:
[   51.272769]                     [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.279525]                     [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   51.287190]                     [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   51.295126]                     [<c011486c>] get_vm_area_caller+0x3c/0x48
[   51.302245]                     [<c0114be0>] vmap+0x40/0x78
[   51.308094]                     [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   51.315669]                     [<c093eac0>] start_kernel+0x320/0x388
[   51.322423]                     [<80008074>] 0x80008074
[   51.327906]    INITIAL USE at:
[   51.331112]                    [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.337775]                    [<c011376c>] alloc_vmap_area.isra.28+0xb8/0x300
[   51.345352]                    [<c0113a44>] __get_vm_area_node.isra.29+0x90/0x134
[   51.353197]                    [<c011486c>] get_vm_area_caller+0x3c/0x48
[   51.360224]                    [<c0114be0>] vmap+0x40/0x78
[   51.365977]                    [<c09442f0>] check_writebuffer_bugs+0x54/0x1a0
[   51.373464]                    [<c093eac0>] start_kernel+0x320/0x388
[   51.380131]                    [<80008074>] 0x80008074
[   51.385517]  }
[   51.387260]  ... key      at: [<c0a66948>] vmap_area_lock+0x10/0x20
[   51.393841]  ... acquired at:
[   51.396945]    [<c00899ac>] lock_acquire+0x9c/0x104
[   51.402060]    [<c06584e8>] _raw_spin_lock+0x28/0x38
[   51.407266]    [<c01127fc>] find_vmap_area+0x10/0x6c
[   51.412478]    [<c0114960>] remove_vm_area+0x8/0x6c
[   51.417592]    [<c0114a7c>] __vunmap+0x20/0xf8
[   51.422252]    [<c001c350>] __arm_iounmap+0x10/0x18
[   51.427369]    [<bf017d08>] cpdma_ctlr_destroy+0xf0/0x114 [davinci_cpdma]
[   51.434487]    [<bf026294>] cpsw_remove+0x48/0x8c [ti_cpsw]
[   51.440336]    [<c03e62c8>] platform_drv_remove+0x18/0x1c
[   51.446000]    [<c03e4c44>] __device_release_driver+0x70/0xc8
[   51.452031]    [<c03e5458>] driver_detach+0xb4/0xb8
[   51.457147]    [<c03e4a6c>] bus_remove_driver+0x4c/0x90
[   51.462628]    [<c00b024c>] SyS_delete_module+0x10c/0x198
[   51.468289]    [<c000e580>] ret_fast_syscall+0x0/0x48
[   51.473584]
[   51.475140]
[   51.475140] stack backtrace:
[   51.479703] CPU: 0 PID: 1921 Comm: modprobe Not tainted 3.14.19-02124-g95c5b7b torvalds#308
[   51.487744] [<c0016090>] (unwind_backtrace) from [<c0012060>] (show_stack+0x10/0x14)
[   51.495865] [<c0012060>] (show_stack) from [<c0652a20>] (dump_stack+0x78/0x94)
[   51.503444] [<c0652a20>] (dump_stack) from [<c0086f18>] (check_usage+0x408/0x594)
[   51.511293] [<c0086f18>] (check_usage) from [<c00870f8>] (check_irq_usage+0x54/0xb0)
[   51.519416] [<c00870f8>] (check_irq_usage) from [<c0088724>] (__lock_acquire+0xe54/0x1b90)
[   51.528077] [<c0088724>] (__lock_acquire) from [<c00899ac>] (lock_acquire+0x9c/0x104)
[   51.536291] [<c00899ac>] (lock_acquire) from [<c06584e8>] (_raw_spin_lock+0x28/0x38)
[   51.544417] [<c06584e8>] (_raw_spin_lock) from [<c01127fc>] (find_vmap_area+0x10/0x6c)
[   51.552726] [<c01127fc>] (find_vmap_area) from [<c0114960>] (remove_vm_area+0x8/0x6c)
[   51.560935] [<c0114960>] (remove_vm_area) from [<c0114a7c>] (__vunmap+0x20/0xf8)
[   51.568693] [<c0114a7c>] (__vunmap) from [<c001c350>] (__arm_iounmap+0x10/0x18)
[   51.576362] [<c001c350>] (__arm_iounmap) from [<bf017d08>] (cpdma_ctlr_destroy+0xf0/0x114 [davinci_cpdma])
[   51.586494] [<bf017d08>] (cpdma_ctlr_destroy [davinci_cpdma]) from [<bf026294>] (cpsw_remove+0x48/0x8c [ti_cpsw])
[   51.597261] [<bf026294>] (cpsw_remove [ti_cpsw]) from [<c03e62c8>] (platform_drv_remove+0x18/0x1c)
[   51.606659] [<c03e62c8>] (platform_drv_remove) from [<c03e4c44>] (__device_release_driver+0x70/0xc8)
[   51.616237] [<c03e4c44>] (__device_release_driver) from [<c03e5458>] (driver_detach+0xb4/0xb8)
[   51.625264] [<c03e5458>] (driver_detach) from [<c03e4a6c>] (bus_remove_driver+0x4c/0x90)
[   51.633749] [<c03e4a6c>] (bus_remove_driver) from [<c00b024c>] (SyS_delete_module+0x10c/0x198)
[   51.642781] [<c00b024c>] (SyS_delete_module) from [<c000e580>] (ret_fast_syscall+0x0/0x48)

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
A panic was seen in the following sitation.

There are two threads running on the system. The first thread is a system
monitoring thread that is reading /proc/modules. The second thread is
loading and unloading a module (in this example I'm using my simple
dummy-module.ko).  Note, in the "real world" this occurred with the qlogic
driver module.

When doing this, the following panic occurred:

 ------------[ cut here ]------------
 kernel BUG at kernel/module.c:3739!
 invalid opcode: 0000 [#1] SMP
 Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module]
 CPU: 37 PID: 186343 Comm: cat Tainted: GF          O--------------   3.10.0+ torvalds#7
 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
 task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000
 RIP: 0010:[<ffffffff810d64c5>]  [<ffffffff810d64c5>] module_flags+0xb5/0xc0
 RSP: 0018:ffff88080fa7fe18  EFLAGS: 00010246
 RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000
 RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000
 RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000
 R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000
 R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000
 FS:  00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Stack:
  ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c
  ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00
  ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0
 Call Trace:
  [<ffffffff810d666c>] m_show+0x19c/0x1e0
  [<ffffffff811e4d7e>] seq_read+0x16e/0x3b0
  [<ffffffff812281ed>] proc_reg_read+0x3d/0x80
  [<ffffffff811c0f2c>] vfs_read+0x9c/0x170
  [<ffffffff811c1a58>] SyS_read+0x58/0xb0
  [<ffffffff81605829>] system_call_fastpath+0x16/0x1b
 Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
 RIP  [<ffffffff810d64c5>] module_flags+0xb5/0xc0
  RSP <ffff88080fa7fe18>

    Consider the two processes running on the system.

    CPU 0 (/proc/modules reader)
    CPU 1 (loading/unloading module)

    CPU 0 opens /proc/modules, and starts displaying data for each module by
    traversing the modules list via fs/seq_file.c:seq_open() and
    fs/seq_file.c:seq_read().  For each module in the modules list, seq_read
    does

            op->start()  <-- this is a pointer to m_start()
            op->show()   <- this is a pointer to m_show()
            op->stop()   <-- this is a pointer to m_stop()

    The m_start(), m_show(), and m_stop() module functions are defined in
    kernel/module.c. The m_start() and m_stop() functions acquire and release
    the module_mutex respectively.

    ie) When reading /proc/modules, the module_mutex is acquired and released
    for each module.

    m_show() is called with the module_mutex held.  It accesses the module
    struct data and attempts to write out module data.  It is in this code
    path that the above BUG_ON() warning is encountered, specifically m_show()
    calls

    static char *module_flags(struct module *mod, char *buf)
    {
            int bx = 0;

            BUG_ON(mod->state == MODULE_STATE_UNFORMED);
    ...

    The other thread, CPU 1, in unloading the module calls the syscall
    delete_module() defined in kernel/module.c.  The module_mutex is acquired
    for a short time, and then released.  free_module() is called without the
    module_mutex.  free_module() then sets mod->state = MODULE_STATE_UNFORMED,
    also without the module_mutex.  Some additional code is called and then the
    module_mutex is reacquired to remove the module from the modules list:

        /* Now we can delete it from the lists */
        mutex_lock(&module_mutex);
        stop_machine(__unlink_module, mod, NULL);
        mutex_unlock(&module_mutex);

This is the sequence of events that leads to the panic.

CPU 1 is removing dummy_module via delete_module().  It acquires the
module_mutex, and then releases it.  CPU 1 has NOT set dummy_module->state to
MODULE_STATE_UNFORMED yet.

CPU 0, which is reading the /proc/modules, acquires the module_mutex and
acquires a pointer to the dummy_module which is still in the modules list.
CPU 0 calls m_show for dummy_module.  The check in m_show() for
MODULE_STATE_UNFORMED passed for dummy_module even though it is being
torn down.

Meanwhile CPU 1, which has been continuing to remove dummy_module without
holding the module_mutex, now calls free_module() and sets
dummy_module->state to MODULE_STATE_UNFORMED.

CPU 0 now calls module_flags() with dummy_module and ...

static char *module_flags(struct module *mod, char *buf)
{
        int bx = 0;

        BUG_ON(mod->state == MODULE_STATE_UNFORMED);

and BOOM.

Acquire and release the module_mutex lock around the setting of
MODULE_STATE_UNFORMED in the teardown path, which should resolve the
problem.

Testing: In the unpatched kernel I can panic the system within 1 minute by
doing

while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done

and

while (true) do cat /proc/modules; done

in separate terminals.

In the patched kernel I was able to run just over one hour without seeing
any issues.  I also verified the output of panic via sysrq-c and the output
of /proc/modules looks correct for all three states for the dummy_module.

        dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-)
        dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE)
        dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+)

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
If dma_async_device_register() returns error and probe should clean up
and return error, a NULL pointer exception happens because of
dereference of not allocated channel thread:

Dmesg log (from early printk):
dma-pl330 12680000.pdma: unable to register DMAC
DMA pl330_control: removing pch: eeac4000, chan: eeac4014, thread:   (null)
Unable to handle kernel NULL pointer dereference at virtual address 0000000c
pgd = c0004000
[0000000c] *pgd=00000000
Internal error: Oops: 5 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 3.17.0-rc3-next-20140904-00005-g6cc4c1937d90-dirty torvalds#427
task: ee80a800 ti: ee888000 task.ti: ee888000
PC is at _stop+0x8/0x2c8
LR is at pl330_control+0x70/0x2e8
pc : [<c0205dc8>]    lr : [<c020623c>]    psr: 60000193
sp : ee889df8  ip : 00000002  fp : 00000000
r10: eeac4014  r9 : ee0e62bc  r8 : 00000000
r7 : eeac405c  r6 : 60000113  r5 : ee0e6210  r4 : eeac4000
r3 : 00000002  r2 : 00000002  r1 : 00010000  r0 : 00000000
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
Control: 10c5387d  Table: 4000404a  DAC: 00000015
Process swapper/0 (pid: 1, stack limit = 0xee888240)
Stack: (0xee889df8 to 0xee88a000)
9de0:                                                       00000002 eeac4000
9e00: ee0e6210 eeac4000 ee0e6210 60000113 eeac405c c020623c 00000000 c020725c
9e20: ee889e20 ee889e20 ee0e6210 eeac4080 00200200 00100100 eeac4014 00000020
9e40: ee0e6218 c0208374 00000000 ee9bb340 ee0e6210 00000000 00000000 c0605cd8
9e60: ee970000 c0605c84 ee9700f8 00000000 c05c4270 00000000 00000000 c0203b3c
9e80: ee970000 c06624a8 00000000 c0605c84 00000000 c023f890 ee970000 c0605c84
9ea0: ee970034 00000000 c05b23d0 c023fa3c 00000000 c0605c84 c023f9b0 c023e0d4
9ec0: ee947e78 ee9b9440 c0605c84 eea1e780 c0605acc c023f094 c0513b50 c0605c84
9ee0: c05ecbd8 c0605c84 c05ecbd8 ee11ba40 c0626500 c0240064 00000000 c05ecbd8
9f00: c05ecbd8 c0008964 c040f13c 0000009f c0626500 c057465c ee80a800 60000113
9f20: 00000000 c05efdb0 60000113 00000000 ef7fc89d c0421168 0000008f c003787c
9f40: c0573d6c 00000006 ef7fc8bb 00000006 c05efd50 ef7fc800 c05dfbc4 00000006
9f60: c05c4264 c0626500 0000008f c05c4270 c059b518 c059bcb4 00000006 00000006
9f80: c059b518 c003c08c 00000000 c040091c 00000000 00000000 00000000 00000000
9fa0: 00000000 c0400924 00000000 c000e7b8 00000000 00000000 00000000 00000000
9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 c0c0c0c0 c0c0c0c0
[<c0205dc8>] (_stop) from [<c020623c>] (pl330_control+0x70/0x2e8)
[<c020623c>] (pl330_control) from [<c0208374>] (pl330_probe+0x594/0x75c)
[<c0208374>] (pl330_probe) from [<c0203b3c>] (amba_probe+0xb8/0x120)
[<c0203b3c>] (amba_probe) from [<c023f890>] (driver_probe_device+0x10c/0x22c)
[<c023f890>] (driver_probe_device) from [<c023fa3c>] (__driver_attach+0x8c/0x90)
[<c023fa3c>] (__driver_attach) from [<c023e0d4>] (bus_for_each_dev+0x54/0x88)
[<c023e0d4>] (bus_for_each_dev) from [<c023f094>] (bus_add_driver+0xd4/0x1d0)
[<c023f094>] (bus_add_driver) from [<c0240064>] (driver_register+0x78/0xf4)
[<c0240064>] (driver_register) from [<c0008964>] (do_one_initcall+0x80/0x1d0)
[<c0008964>] (do_one_initcall) from [<c059bcb4>] (kernel_init_freeable+0x108/0x1d4)
[<c059bcb4>] (kernel_init_freeable) from [<c0400924>] (kernel_init+0x8/0xec)
[<c0400924>] (kernel_init) from [<c000e7b8>] (ret_from_fork+0x14/0x3c)
Code: e5813010 e12fff1e e92d40f0 e24dd00c (e590200c)
---[ end trace c94b2f4f38dff3bf ]---

This happens because the necessary resources were not yet allocated - no
call to pl330_alloc_chan_resources().

Terminate the thread and free channel resource only if channel thread is not NULL.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: <stable@vger.kernel.org>
Fixes: 0b94c57 ("DMA: PL330: Add check if device tree compatible")
Reviewed-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
wutiejun pushed a commit that referenced this pull request Oct 25, 2014
Fix a NULL pointer dereference after unbinding the driver, if channel
resources were not yet allocated (no call to
pl330_alloc_chan_resources()):
$ echo 12850000.mdma > /sys/bus/amba/drivers/dma-pl330/unbind
[   13.606533] DMA pl330_control: removing pch: eeab6800, chan: eeab6814, thread:   (null)
[   13.614472] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[   13.622537] pgd = ee284000
[   13.625228] [0000000c] *pgd=6e1e4831, *pte=00000000, *ppte=00000000
[   13.631482] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[   13.636859] Modules linked in:
[   13.639903] CPU: 0 PID: 1 Comm: sh Not tainted 3.17.0-rc3-next-20140904-00004-g7020ffc33ca3-dirty torvalds#420
[   13.649187] task: ee80a800 ti: ee888000 task.ti: ee888000
[   13.654589] PC is at _stop+0x8/0x2c8
[   13.658131] LR is at pl330_control+0x70/0x2e8
[   13.662468] pc : [<c0206028>]    lr : [<c020649c>]    psr: 60000093
[   13.662468] sp : ee889e58  ip : 00000001  fp : 000bab70
[   13.673922] r10: eeab6814  r9 : ee16debc  r8 : 00000000
[   13.679131] r7 : eeab685c  r6 : 60000013  r5 : ee16de10  r4 : eeab6800
[   13.685641] r3 : 00000002  r2 : 00000000  r1 : 00010000  r0 : 00000000
[   13.692153] Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
[   13.699357] Control: 10c5387d  Table: 6e28404a  DAC: 00000015
[   13.705085] Process sh (pid: 1, stack limit = 0xee888240)
[   13.710466] Stack: (0xee889e58 to 0xee88a000)
[   13.714808] 9e40:                                                       00000002 eeab6800
[   13.722969] 9e60: ee16de10 eeab6800 ee16de10 60000013 eeab685c c020649c 00000000 c040280c
[   13.731128] 9e80: ee889e80 ee889e80 ee16de18 ee16de10 eeab6880 eeab6814 00200200 eeab68a8
[   13.739287] 9ea0: 00100100 c0208048 00000000 c0409fc4 eea80800 eea808f8 c0605c44 0000000e
[   13.747446] 9ec0: 0000000e eeb3960c eeb39600 c0203c48 eea80800 c0605c44 c0605a8c c023f694
[   13.755605] 9ee0: ee80a800 eea80834 eea80800 c023f704 ee80a800 eea80800 c0605c44 c023e8ec
[   13.763764] 9f00: 0000000e ee149780 ee29e580 ee889f80 ee29e580 c023e19c 0000000e c01167e4
[   13.771923] 9f20: c01167a0 00000000 00000000 c0115e88 00000000 00000000 ee0b1a00 0000000e
[   13.780082] 9f40: b6f48000 ee889f80 0000000e ee888000 b6f48000 c00bfadc 00000000 00000003
[   13.788241] 9f60: 00000000 00000000 00000000 ee0b1a00 ee0b1a00 0000000e b6f48000 c00bfdf4
[   13.796401] 9f80: 00000000 00000000 ffffffff 0000000e b6f48000 b6edc5d0 00000004 c000e7a4
[   13.804560] 9fa0: 00000000 c000e620 0000000e b6f48000 00000001 b6f48000 0000000e 00000000
[   13.812719] 9fc0: 0000000e b6f48000 b6edc5d0 00000004 0000000e b6f4c8c0 000c3470 000bab70
[   13.820879] 9fe0: 00000000 bed2aa50 b6e18bdc b6e6b52c 60000010 00000001 c0c0c0c0 c0c0c0c0
[   13.829058] [<c0206028>] (_stop) from [<c020649c>] (pl330_control+0x70/0x2e8)
[   13.836165] [<c020649c>] (pl330_control) from [<c0208048>] (pl330_remove+0xb0/0xdc)
[   13.843800] [<c0208048>] (pl330_remove) from [<c0203c48>] (amba_remove+0x24/0xc0)
[   13.851272] [<c0203c48>] (amba_remove) from [<c023f694>] (__device_release_driver+0x70/0xc4)
[   13.859685] [<c023f694>] (__device_release_driver) from [<c023f704>] (device_release_driver+0x1c/0x28)
[   13.868971] [<c023f704>] (device_release_driver) from [<c023e8ec>] (unbind_store+0x58/0x90)
[   13.877303] [<c023e8ec>] (unbind_store) from [<c023e19c>] (drv_attr_store+0x20/0x2c)
[   13.885036] [<c023e19c>] (drv_attr_store) from [<c01167e4>] (sysfs_kf_write+0x44/0x48)
[   13.892928] [<c01167e4>] (sysfs_kf_write) from [<c0115e88>] (kernfs_fop_write+0xc0/0x17c)
[   13.901090] [<c0115e88>] (kernfs_fop_write) from [<c00bfadc>] (vfs_write+0xa0/0x1a8)
[   13.908812] [<c00bfadc>] (vfs_write) from [<c00bfdf4>] (SyS_write+0x40/0x8c)
[   13.915850] [<c00bfdf4>] (SyS_write) from [<c000e620>] (ret_fast_syscall+0x0/0x30)
[   13.923392] Code: e5813010 e12fff1e e92d40f0 e24dd00c (e590200c)
[   13.929467] ---[ end trace 10064e15a5929cf8 ]---

Terminate the thread and free channel resource only if channel resources
were allocated (thread is not NULL).

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: <stable@vger.kernel.org>
Fixes: b3040e4 ("DMA: PL330: Add dma api driver")
Reviewed-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.