Ahmad Fatoum [Mon, 17 Feb 2025 20:39:49 +0000 (21:39 +0100)]
platform/chrome: cros_ec_lpc: prepare for hw_protection_shutdown removal
In the general case, a driver doesn't know which of system shutdown or
reboot is the better action to take to protect hardware in an emergency
situation. For this reason, hw_protection_shutdown is going to be removed
in favor of hw_protection_trigger, which defaults to shutdown, but may be
configured at kernel runtime to be a reboot instead.
The ChromeOS EC situation is different as we do know that shutdown is the
correct action as the EC is programmed to force reset after the short
period, thus replace hw_protection_shutdown with __hw_protection_trigger
with HWPROT_ACT_SHUTDOWN as argument to maintain the same behavior.
No functional change.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-9-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Acked-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:48 +0000 (21:39 +0100)]
regulator: allow user configuration of hardware protection action
When the core detects permanent regulator hardware failure or imminent
power failure of a critical supply, it will call hw_protection_shutdown in
an attempt to do a limited orderly shutdown followed by powering off the
system.
This doesn't work out well for many unattended embedded systems that don't
have support for shutdown and that power on automatically when power is
supplied:
- A brief power cycle gets detected by the driver
- The kernel powers down the system and SoC goes into shutdown mode
- Power is restored
- The system remains oblivious to the restored power
- System needs to be manually power cycled for a duration long enough
to drain the capacitors
Allow users to fix this by calling the newly introduced
hw_protection_trigger() instead: This way the hw_protection commandline or
sysfs parameter is used to dictate the policy of dealing with the
regulator fault.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-8-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:47 +0000 (21:39 +0100)]
reboot: add support for configuring emergency hardware protection action
We currently leave the decision of whether to shutdown or reboot to
protect hardware in an emergency situation to the individual drivers.
This works out in some cases, where the driver detecting the critical
failure has inside knowledge: It binds to the system management controller
for example or is guided by hardware description that defines what to do.
In the general case, however, the driver detecting the issue can't know
what the appropriate course of action is and shouldn't be dictating the
policy of dealing with it.
Therefore, add a global hw_protection toggle that allows the user to
specify whether shutdown or reboot should be the default action when the
driver doesn't set policy.
This introduces no functional change yet as hw_protection_trigger() has no
callers, but these will be added in subsequent commits.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-7-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:46 +0000 (21:39 +0100)]
reboot: indicate whether it is a HARDWARE PROTECTION reboot or shutdown
It currently depends on the caller, whether we attempt a hardware
protection shutdown (poweroff) or a reboot. A follow-up commit will make
this partially user-configurable, so it's a good idea to have the
emergency message clearly state whether the kernel is going for a reboot
or a shutdown.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-6-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:45 +0000 (21:39 +0100)]
reboot: rename now misleading __hw_protection_shutdown symbols
The __hw_protection_shutdown function name has become misleading since it
can cause either a shutdown (poweroff) or a reboot depending on its
argument.
To avoid further confusion, let's rename it, so it doesn't suggest that a
poweroff is all it can do.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-5-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:43 +0000 (21:39 +0100)]
docs: thermal: sync hardware protection doc with code
Originally, the thermal framework's only hardware protection action was to
trigger a shutdown. This has been changed a little over a year ago to
also support rebooting as alternative hardware protection action.
Update the documentation to reflect this.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-3-e1c09b090c0c@pengutronix.de Fixes: 62e79e38b257 ("thermal/thermal_of: Allow rebooting after critical temp") Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:42 +0000 (21:39 +0100)]
reboot: reboot, not shutdown, on hw_protection_reboot timeout
hw_protection_shutdown() will kick off an orderly shutdown and if that
takes longer than a configurable amount of time, an emergency shutdown
will occur.
Recently, hw_protection_reboot() was added for those systems that don't
implement a proper shutdown and are better served by rebooting and having
the boot firmware worry about doing something about the critical
condition.
On timeout of the orderly reboot of hw_protection_reboot(), the system
would go into shutdown, instead of reboot. This is not a good idea, as
going into shutdown was explicitly not asked for.
Fix this by always doing an emergency reboot if hw_protection_reboot() is
called and the orderly reboot takes too long.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-2-e1c09b090c0c@pengutronix.de Fixes: 79fa723ba84c ("reboot: Introduce thermal_zone_device_critical_reboot()") Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ahmad Fatoum [Mon, 17 Feb 2025 20:39:41 +0000 (21:39 +0100)]
reboot: replace __hw_protection_shutdown bool action parameter with an enum
Patch series "reboot: support runtime configuration of emergency
hw_protection action", v3.
We currently leave the decision of whether to shutdown or reboot to
protect hardware in an emergency situation to the individual drivers.
This works out in some cases, where the driver detecting the critical
failure has inside knowledge: It binds to the system management controller
for example or is guided by hardware description that defines what to do.
This is inadequate in the general case though as a driver reporting e.g.
an imminent power failure can't know whether a shutdown or a reboot would
be more appropriate for a given hardware platform.
To address this, this series adds a hw_protection kernel parameter and
sysfs toggle that can be used to change the action from the shutdown
default to reboot. A new hw_protection_trigger API then makes use of this
default action.
My particular use case is unattended embedded systems that don't have
support for shutdown and that power on automatically when power is
supplied:
- A brief power cycle gets detected by the driver
- The kernel powers down the system and SoC goes into shutdown mode
- Power is restored
- The system remains oblivious to the restored power
- System needs to be manually power cycled for a duration long enough
to drain the capacitors
With this series, such systems can configure the kernel with
hw_protection=reboot to have the boot firmware worry about critical
conditions.
This patch (of 12):
Currently __hw_protection_shutdown() either reboots or shuts down the
system according to its shutdown argument.
To make the logic easier to follow, both inside __hw_protection_shutdown
and at caller sites, lets replace the bool parameter with an enum.
This will be extra useful, when in a later commit, a third action is added
to the enumeration.
No functional change.
Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-0-e1c09b090c0c@pengutronix.de Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-1-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Buffer heads are attached to folios, not to pages. Also
flush_dcache_page() is now deprecated in favour of flush_dcache_folio().
Link: https://lkml.kernel.org/r/20250213214533.2242224-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: use memcpy_to_folio() in ocfs2_symlink_get_block()
Replace use of kmap_atomic() with the higher-level construct
memcpy_to_folio(). This removes a use of b_page and supports large folios
as well as being easier to understand. It also removes the check for
kmap_atomic() failing (because it can't).
Link: https://lkml.kernel.org/r/20250213214533.2242224-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Tinguely <mark.tinguely@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alban Kurti [Fri, 7 Feb 2025 18:39:06 +0000 (18:39 +0000)]
checkpatch: add warning for pr_* and dev_* macros without a trailing newline
Add a new check in scripts/checkpatch.pl to detect usage of pr_(level) and
dev_(level) macros (for both C and Rust) when the string literal does not
end with '\n'. Missing trailing newlines can lead to incomplete log lines
that do not appear properly in dmesg or in console output. To show an
example of this working after applying the patch we can run the script on
the commit that likely motivated this need/issue:
Also, the patch is able to handle correctly if there is a printing call
without a newline which then has a newline printed via pr_cont for both
Rust and C alike. If there is no newline printed and the patch ends or
there is another pr_* call before a newline with pr_cont is printed it
will show a warning. Not implemented for dev_cont because it is not clear
to me if that is used at all.
One false warning that will be generated due to this change is in case we
have a patch that modifies a `pr_* call without a newline` which has a
pr_cont with a newline following it. In this case there will be a warning
but because the patch does not include the following pr_cont it will warn
there is nothing creating a newline. I have modified the warning to be
softer due to this known problem.
I have tested with comments, whitespace, differen orders of pr_* calls and
pr_cont and the only case that I suspect to be a problem is the one
outlined above.
Link: https://lkml.kernel.org/r/20250207-checkpatch-newline2-v4-1-26d8e80d0059@invicto.ai Signed-off-by: Alban Kurti <kurti@invicto.ai> Suggested-by: Miguel Ojeda <ojeda@kernel.org> Closes: https://github.com/Rust-for-Linux/linux/issues/1140 Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Andy Whitcroft <apw@canonical.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Joe Perches <joe@perches.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: Charalampos Mitrodimas <charmitro@posteo.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use rcuref_t for reference counting. This eliminates the cmpxchg loop in
the get and put path. This also eliminates the need to acquire the lock
in the put path because once the final user returns the reference, it can
no longer be obtained anymore.
Use rcuref_t for reference counting.
Link: https://lkml.kernel.org/r/20250203150525.456525-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The ucounts element is looked up under ucounts_lock. This can be
optimized by using RCU for a lockless lookup and return and element if the
reference can be obtained.
Replace hlist_head with hlist_nulls_head which is RCU compatible. Let
find_ucounts() search for the required item within a RCU section and
return the item if a reference could be obtained. This means
alloc_ucounts() will always return an element (unless the memory
allocation failed). Let put_ucounts() RCU free the element if the
reference counter dropped to zero.
Link: https://lkml.kernel.org/r/20250203150525.456525-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ucount: replace get_ucounts_or_wrap() with atomic_inc_not_zero()
get_ucounts_or_wrap() increments the counter and if the counter is
negative then it decrements it again in order to reset the previous
increment. This statement can be replaced with atomic_inc_not_zero() to
only increment the counter if it is not yet 0.
This simplifies the get function because the put (if the get failed) can
be removed. atomic_inc_not_zero() is implement as a cmpxchg() loop which
can be repeated several times if another get/put is performed in parallel.
This will be optimized later.
Increment the reference counter only if not yet dropped to zero.
Link: https://lkml.kernel.org/r/20250203150525.456525-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
rcu: provide a static initializer for hlist_nulls_head
Patch series "ucount: Simplify refcounting with rcuref_t".
I noticed that the atomic_dec_and_lock_irqsave() in put_ucounts() loops
sometimes even during boot. Something like 2-3 iterations but still.
This series replaces the refcounting with rcuref_t and adds a RCU lookup.
This allows a lockless lookup in alloc_ucounts() if the entry is available
and a cmpxchg()less put of the item.
This patch (of 4):
Provide a static initializer for hlist_nulls_head so that it can be used
in statically defined data structures.
Link: https://lkml.kernel.org/r/20250203150525.456525-1-bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20250203150525.456525-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yury Norov [Wed, 5 Feb 2025 21:29:32 +0000 (16:29 -0500)]
lib/zlib: drop EQUAL macro
The macro is prehistoric, and only exists to help those readers who don't
know what memcmp() returns if memory areas differ. This is pretty well
documented, so the macro looks excessive.
Now that the only user of the macro depends on DEBUG_ZLIB config, GCC
warns about unused macro if the library is built with W=2 against
defconfig. So drop it for good.
Vlastimil Babka [Tue, 11 Feb 2025 15:16:11 +0000 (16:16 +0100)]
get_maintainer: add --substatus for reporting subsystem status - fix
The automatically enabled --substatus can break existing scripts that do
not disable --rolestats. Require that script output goes to a terminal to
enable it automatically.
Vlastimil Babka [Mon, 3 Feb 2025 11:13:16 +0000 (12:13 +0100)]
get_maintainer: add --substatus for reporting subsystem status
Patch series "get_maintainer: report subsystem status separately", v2.
The subsystem status (S: field) can inform a patch submitter if the
subsystem is well maintained or e.g. maintainers are missing. In
get_maintainer, it is currently reported with --role(stats) by adjusting
the maintainer role for any status different from Maintained. This has
two downsides:
- if a subsystem has only reviewers or mailing lists and no maintainers,
the status is not reported. For example Orphan subsystems typically
have no maintainers so there's nobody to report as orphan minder.
- the Supported status means that someone is paid for maintaining, but
it is reported as "supporter" for all the maintainers, which can be
incorrect (only some of them may be paid). People (including myself)
have been also confused about what "supporter" means.
The second point has been brought up in 2022 and the discussion in the end
resulted in adjusting documentation only [1]. I however agree with Ted's
points that it's misleading to take the subsystem status and apply it to
all maintainers [2].
The attempt to modify get_maintainer output was retracted after Joe
objected that the status becomes not reported at all [3]. This series
addresses that concern by reporting the status (unless it's the most
common Maintained one) on separate lines that follow the reported emails,
using a new --substatus parameter. Care is taken to reduce the noise to
minimum by not reporting the most common Maintained status, by default
require no opt-in that would need the users to discover the new parameter,
and at the same time not to break existing git --cc-cmd usage.
The subsystem status is currently reported with --role(stats) by adjusting
the maintainer role for any status different from Maintained. This has
two downsides:
- if a subsystem has only reviewers or mailing lists and no maintainers,
the status is not reported (i.e. typically, Orphan subsystems have no
maintainers)
- the Supported status means that someone is paid for maintaining, but
it is reported as "supporter" for all the maintainers, which can be
incorrect. People have been also confused about what "supporter"
means.
This patch introduces a new --substatus option and functionality aimed to
report the subsystem status separately, without adjusting the reported
maintainer role. After the e-mails are output, the status of subsystems
will follow, for example:
Sourabh Jain [Fri, 31 Jan 2025 11:38:30 +0000 (17:08 +0530)]
powerpc/crash: use generic crashkernel reservation
Commit 0ab97169aa05 ("crash_core: add generic function to do reservation")
added a generic function to reserve crashkernel memory. So let's use the
same function on powerpc and remove the architecture-specific code that
essentially does the same thing.
The generic crashkernel reservation also provides a way to split the
crashkernel reservation into high and low memory reservations, which can
be enabled for powerpc in the future.
Along with moving to the generic crashkernel reservation, the code related
to finding the base address for the crashkernel has been separated into
its own function name get_crash_base() for better readability and
maintainability.
Link: https://lkml.kernel.org/r/20250131113830.925179-8-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sourabh Jain [Fri, 31 Jan 2025 11:38:29 +0000 (17:08 +0530)]
powerpc: insert System RAM resource to prevent crashkernel conflict
The next patch in the series with title "powerpc/crash: use generic
crashkernel reservation" enables powerpc to use generic crashkernel
reservation instead of custom implementation. This leads to exporting of
`Crash Kernel` memory to iomem_resource (/proc/iomem) via
insert_crashkernel_resources():kernel/crash_reserve.c or at another place
in the same file if HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY is set.
The add_system_ram_resources():arch/powerpc/mm/mem.c adds `System RAM` to
iomem_resource using request_resource(). This creates a conflict with
`Crash Kernel`, which is added by the generic crashkernel reservation
code. As a result, the kernel ultimately fails to add `System RAM` to
iomem_resource. Consequently, it does not appear in /proc/iomem.
There are multiple approches tried to avoid this:
1. Don't add Crash Kernel to iomem_resource:
- This has two issues.
First, it requires adding an architecture-specific hook in the
generic code. There are already two code paths to choose when to
add `Crash Kernel` to iomem_resource. This adds one more code path
to skip it.
Second, what if `Crash Kernel` is required in /proc/iomem in the
future? Many architectures do export it.
2. Don't add `System RAM` to iomem_resource by reverting commit c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem"):
- It's not ideal to export `System RAM` via /proc/iomem, but since
it already done ealier and userspace tools like kdump and
kdump-utils rely on `System RAM` from /proc/iomem, removing it
will break userspace.
3. Add Crash Kernel along with System RAM to /proc/iomem:
This patch takes the third approach by updating add_system_ram_resources()
to use insert_resource() instead of the request_resource() API to add the
`System RAM` resource to iomem_resource. insert_resource() allows
inserting resources even if they overlap with existing ones. Since `Crash
Kernel` and `System RAM` resources are added to iomem_resource early in
the boot, any other conflict is not expected.
With the changes introduced here and in the next patch, "powerpc/crash:
use generic crashkernel reservation," /proc/iomem now exports `System RAM`
and `Crash Kernel` as shown below:
Commit 59d58189f3d9 ("crash: fix crash memory reserve exceed system memory
bug") fails crashkernel parsing if the crash size is found to be higher
than system RAM, which makes the memory_limit adjustment code ineffective
due to an early exit from reserve_crashkernel().
Regardless lets not violate the user-specified memory limit by adjusting
it. Remove this adjustment to ensure all reservations stay within the
limit. Commit f94f5ac07983 ("powerpc/fadump: Don't update the
user-specified memory limit") did the same for fadump.
Link: https://lkml.kernel.org/r/20250131113830.925179-6-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sourabh Jain [Fri, 31 Jan 2025 11:38:27 +0000 (17:08 +0530)]
powerpc/crash: use generic APIs to locate memory hole for kdump
On PowerPC, the memory reserved for the crashkernel can contain components
like RTAS, TCE, OPAL, etc., which should be avoided when loading kexec
segments into crashkernel memory. Due to these special components,
PowerPC has its own set of APIs to locate holes in the crashkernel memory
for loading kexec segments for kdump. However, for loading kexec segments
in the kexec case, PowerPC already uses generic APIs to locate holes.
The previous patch in this series, titled "crash: Let arch decide usable
memory range in reserved area," introduced arch-specific hook to handle
such special regions in the crashkernel area. So, switch PowerPC to use
the generic APIs to locate memory holes for kdump and remove the redundant
PowerPC-specific APIs.
Link: https://lkml.kernel.org/r/20250131113830.925179-5-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sourabh Jain [Fri, 31 Jan 2025 11:38:26 +0000 (17:08 +0530)]
crash: let arch decide usable memory range in reserved area
Although the crashkernel area is reserved, on architectures like PowerPC,
it is possible for the crashkernel reserved area to contain components
like RTAS, TCE, OPAL, etc. To avoid placing kexec segments over these
components, PowerPC has its own set of APIs to locate holes in the
crashkernel reserved area.
Add an arch hook in the generic locate mem hole APIs so that architectures
can handle such special regions in the crashkernel area while locating
memory holes for kexec segments using generic APIs. With this, a lot of
redundant arch-specific code can be removed, as it performs the exact same
job as the generic APIs.
To keep the generic and arch-specific changes separate, the changes
related to moving PowerPC to use the generic APIs and the removal of
PowerPC-specific APIs for memory hole allocation are done in a subsequent
patch titled "powerpc/crash: Use generic APIs to locate memory hole for
kdump.
Link: https://lkml.kernel.org/r/20250131113830.925179-4-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sourabh Jain [Fri, 31 Jan 2025 11:38:24 +0000 (17:08 +0530)]
kexec: initialize ELF lowest address to ULONG_MAX
Patch series "powerpc/crash: use generic crashkernel reservation", v3.
Commit 0ab97169aa05 ("crash_core: add generic function to do reservation")
added a generic function to reserve crashkernel memory. So let's use the
same function on powerpc and remove the architecture-specific code that
essentially does the same thing.
The generic crashkernel reservation also provides a way to split the
crashkernel reservation into high and low memory reservations, which can
be enabled for powerpc in the future.
Additionally move powerpc to use generic APIs to locate memory hole for
kexec segments while loading kdump kernel.
This patch (of 7):
kexec_elf_load() loads an ELF executable and sets the address of the
lowest PT_LOAD section to the address held by the lowest_load_addr
function argument.
To determine the lowest PT_LOAD address, a local variable lowest_addr
(type unsigned long) is initialized to UINT_MAX. After loading each
PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD
address is lower, lowest_addr is updated. However, setting lowest_addr to
UINT_MAX won't work when the kernel image is loaded above 4G, as the
returned lowest PT_LOAD address would be invalid. This is resolved by
initializing lowest_addr to ULONG_MAX instead.
This issue was discovered while implementing crashkernel high/low
reservation on the PowerPC architecture.
I Hsin Cheng [Sun, 19 Jan 2025 06:24:08 +0000 (14:24 +0800)]
lib/plist.c: add shortcut for plist_requeue()
In the operation of plist_requeue(), "node" is deleted from the list
before queueing it back to the list again, which involves looping to find
the tail of same-prio entries.
If "node" is the head of same-prio entries which means its prio_list is on
the priority list, then "node_next" can be retrieve immediately by the
next entry of prio_list, instead of looping nodes on node_list.
The shortcut implementation can benefit plist_requeue() running the below
test, and the test result is shown in the following table.
One can observe from the test result that when the number of nodes of
same-prio entries is smaller, then the probability of hitting the shortcut
can be bigger, thus the benefit can be more significant.
While it tends to behave almost the same for long same-prio entries, since
the probability of taking the shortcut is much smaller.
The test is done on x86_64 architecture with v6.9 kernel and
Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.
Test script( executed in kernel module mode ):
int init_module(void)
{
unsigned int test_data[test_size];
/* Split the list into 10 different priority
* , when test_size is larger, the number of
* nodes within each priority is larger.
*/
for (i = 0; i < ARRAY_SIZE(test_data); i++) {
test_data[i] = i % 10;
}
for (i = 0; i < ARRAY_SIZE(test_node_local); i++) {
plist_node_init(test_node_local + i, 0);
test_node_local[i].prio = test_data[i];
}
for (i = 0; i < ARRAY_SIZE(test_node_local); i++) {
if (plist_node_empty(test_node_local + i)) {
plist_add(test_node_local + i, &test_head_local);
}
}
for (i = 0; i < ARRAY_SIZE(test_node_local); i += 1) {
start = ktime_get();
plist_requeue(test_node_local + i, &test_head_local);
end = ktime_get();
time_elapsed += (end - start);
}
Add a paragraph explaining what sort of capabilities a process would need
to read procfs data for some other process. Also mention that reading
data for its own process doesn't require any extra permissions.
scripts: add script to extract built-in firmware blobs
There is currently no tool to extract a firmware blob that is built-in
on vmlinux to the best of my knowledge. So if we have a kernel image
containing the blobs, and we want to rebuild the kernel with some debug
patches for example (and given that the image also has IKCONFIG=y), we
currently can't do that for the same versions for all the firmware
blobs, _unless_ we have exact commits of linux-firmware for the
specific versions for each firmware included.
Through the options CONFIG_EXTRA_FIRMWARE{_DIR} one is able to build a
kernel including firmware blobs in a built-in fashion. This is usually
the case of built-in drivers that require some blobs in order to work
properly, for example, like in non-initrd based systems.
Add hereby a script to extract these blobs from a non-stripped vmlinux,
similar to the idea of "extract-ikconfig". The firmware loader interface
saves such built-in blobs as rodata entries, having a field for the FW
name as "_fw_<module_name>_<firmware_name>_bin"; the tool extracts files
named "<module_name>_<firmware_name>" for each rodata firmware entry
detected. It makes use of awk, bash, dd and readelf, pretty standard
tooling for Linux development.
With this tool, we can blindly extract the FWs and easily re-add them
in the new debug kernel build, allowing a more deterministic testing
without the burden of "hunting down" the proper version of each
firmware binary.
Link: https://lkml.kernel.org/r/20250120190436.127578-1-gpiccoli@igalia.com Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Reviewed-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Russ Weight <russ.weight@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Yang [Fri, 17 Jan 2025 14:20:13 +0000 (22:20 +0800)]
MAINTAINERS: add Yang Yang as a co-maintainer of PER-TASK DELAY ACCOUNTING
Balbir Singh is the unique maintainer of PER-TASK DELAY ACCOUNTING, and he
had started work on cgroupstats a long time back, this subsystem then is
not growing at a very rapid pace. With their excellent work delay
accounting is still very useful for observing and optimizing system delay,
but still needs continuous improvement. Yang Yang with his team had
worked for most of the recent patches of the subsystem, and he has a
strong willing to help, Balbir Singh is glad to see that, so add him as a
co-maintainer.
Andrii Nakryiko [Mon, 27 Jan 2025 22:21:14 +0000 (14:21 -0800)]
mm,procfs: allow read-only remote mm access under CAP_PERFMON
It's very common for various tracing and profiling toolis to need to
access /proc/PID/maps contents for stack symbolization needs to learn
which shared libraries are mapped in memory, at which file offset, etc.
Currently, access to /proc/PID/maps requires CAP_SYS_PTRACE (unless we are
looking at data for our own process, which is a trivial case not too
relevant for profilers use cases).
Unfortunately, CAP_SYS_PTRACE implies way more than just ability to
discover memory layout of another process: it allows to fully control
arbitrary other processes. This is problematic from security POV for
applications that only need read-only /proc/PID/maps (and other similar
read-only data) access, and in large production settings CAP_SYS_PTRACE is
frowned upon even for the system-wide profilers.
On the other hand, it's already possible to access similar kind of
information (and more) with just CAP_PERFMON capability. E.g., setting up
PERF_RECORD_MMAP collection through perf_event_open() would give one
similar information to what /proc/PID/maps provides.
CAP_PERFMON, together with CAP_BPF, is already a very common combination
for system-wide profiling and observability application. As such, it's
reasonable and convenient to be able to access /proc/PID/maps with
CAP_PERFMON capabilities instead of CAP_SYS_PTRACE.
For procfs, these permissions are checked through common mm_access()
helper, and so we augment that with cap_perfmon() check *only* if
requested mode is PTRACE_MODE_READ. I.e., PTRACE_MODE_ATTACH wouldn't be
permitted by CAP_PERFMON. So /proc/PID/mem, which uses
PTRACE_MODE_ATTACH, won't be permitted by CAP_PERFMON, but /proc/PID/maps,
/proc/PID/environ, and a bunch of other read-only contents will be
allowable under CAP_PERFMON.
Besides procfs itself, mm_access() is used by process_madvise() and
process_vm_{readv,writev}() syscalls. The former one uses
PTRACE_MODE_READ to avoid leaking ASLR metadata, and as such CAP_PERFMON
seems like a meaningful allowable capability as well.
process_vm_{readv,writev} currently assume PTRACE_MODE_ATTACH level of
permissions (though for readv PTRACE_MODE_READ seems more reasonable, but
that's outside the scope of this change), and as such won't be affected by
this patch.
As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there
is a race to free the HugeTLB page between PageHuge() and folio_hstate().
There is no need to add hugetlb_lock here as the HugeTLB page can be freed
in lot of places. So it's enough to unfold folio_hstate() and add a check
to avoid NULL pointer dereference for hugepage_migration_supported().
Link: https://lkml.kernel.org/r/20250122061151.578768-1-liushixin2@huawei.com Fixes: 464c7ffbcb16 ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 8 Jan 2025 07:48:21 +0000 (00:48 -0700)]
mm/hugetlb_vmemmap: fix memory loads ordering
Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB
hugeTLB, HVO reduces the area to 4KB by the following steps:
1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
by PTE 0, and at the same time change the permission from r/w to
r/o;
3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
to 4KB.
However, the following race can happen due to improperly memory loads
ordering:
CPU 1 (HVO) CPU 2 (speculative PFN walker)
atomic_add_unless(&page->_refcount)
XXX: try to modify r/o struct page[]
Specifically, page_is_fake_head() must be ordered after page_ref_count()
on CPU 2 so that it can only return true for this case, to avoid the later
attempt to modify r/o struct page[].
This patch adds the missing memory barrier and makes the tests on
page_is_fake_head() and page_ref_count() done in the proper order.
Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Will Deacon <will@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kemeng Shi [Sat, 22 Feb 2025 16:08:46 +0000 (00:08 +0800)]
mm: swap: use correct step in loop to wait all clusters in wait_for_allocation()
Use correct step in loop to wait all clusters in wait_for_allocation().
If we miss some cluster in wait_for_allocation(), use after free may occur
as follows:
shmem_writepage swapoff
folio_alloc_swap
get_swap_pages
scan_swap_map_slots
cluster_alloc_swap_entry
alloc_swap_scan_cluster
cluster_alloc_range
/* SWP_WRITEOK is valid */
if (!(si->flags & SWP_WRITEOK))
...
del_from_avail_list(p, true);
...
/* miss the cluster in shmem_writepage */
wait_for_allocation()
...
try_to_unuse()
...
add_to_swap_cache
/* dereference swap_address_space(entry) which is NULL */
xas_lock_irq(&xas);
Link: https://lkml.kernel.org/r/20250222160850.505274-3-shikemeng@huaweicloud.com Fixes: 9a0ddeb79880 ("mm, swap: hold a reference during scan and cleanup flag usage") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kemeng Shi [Sat, 22 Feb 2025 16:08:45 +0000 (00:08 +0800)]
mm: swap: avoid losing cluster in swap_reclaim_full_clusters()
If no swap cache is reclaimed, cluster taken off from full_clusters list
will not be put in any list and may not be reused. Do relocate_cluster
for such cluster to fix the issue.
Link: https://lkml.kernel.org/r/20250222160850.505274-2-shikemeng@huaweicloud.com Fixes: 3b644773eefda ("mm, swap: reduce contention on device lock") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Sat, 22 Feb 2025 16:19:52 +0000 (16:19 +0000)]
mm: abort vma_modify() on merge out of memory failure
The remainder of vma_modify() relies upon the vmg state remaining pristine
after a merge attempt.
Usually this is the case, however in the one edge case scenario of a merge
attempt failing not due to the specified range being unmergeable, but
rather due to an out of memory error arising when attempting to commit the
merge, this assumption becomes untrue.
This results in vmg->start, end being modified, and thus the proceeding
attempts to split the VMA will be done with invalid start/end values.
Thankfully, it is likely practically impossible for us to hit this in
reality, as it would require a maple tree node pre-allocation failure that
would likely never happen due to it being 'too small to fail', i.e. the
kernel would simply keep retrying reclaim until it succeeded.
However, this scenario remains theoretically possible, and what we are
doing here is wrong so we must correct it.
The safest option is, when this scenario occurs, to simply give up the
operation. If we cannot allocate memory to merge, then we cannot allocate
memory to split either (perhaps moreso!).
Any scenario where this would be happening would be under very extreme
(likely fatal) memory pressure, so it's best we give up early.
So there is no doubt it is appropriate to simply bail out in this
scenario.
However, in general we must if at all possible never assume VMG state is
stable after a merge attempt, since merge operations update VMG fields.
As a result, additionally also make this clear by storing start, end in
local variables.
The issue was reported originally by syzkaller, and by Brad Spengler (via
an off-list discussion), and in both instances it manifested as a
triggering of the assert:
VM_WARN_ON_VMG(start >= end, vmg);
In vma_merge_existing_range().
It seems at least one scenario in which this is occurring is one in which
the merge being attempted is due to an madvise() across multiple VMAs
which looks like this:
start end
|<------>|
|----------|------|
| vma | next |
|----------|------|
When madvise_walk_vmas() is invoked, we first find vma in the above
(determining prev to be equal to vma as we are offset into vma), and then
enter the loop.
We determine the end of vma that forms part of the range we are
madvise()'ing by setting 'tmp' to this value:
Where the visit() function pointer in this instance is
madvise_vma_behavior().
As observed in syzkaller reports, it is ultimately madvise_update_vma()
that is invoked, calling vma_modify_flags_name() and vma_modify() in turn.
Then, in vma_modify(), we attempt the merge:
merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
We invoke this with vmg->start, end set to start, tmp as such:
start tmp
|<--->|
|----------|------|
| vma | next |
|----------|------|
We find ourselves in the merge right scenario, but the one in which we
cannot remove the middle (we are offset into vma).
Here we have a special case where vmg->start, end get set to perhaps
unintuitive values - we intended to shrink the middle VMA and expand the
next.
This means vmg->start, end are set to... vma->vm_start, start.
Now the commit_merge() fails, and vmg->start, end are left like this.
This means we return to the rest of vma_modify() with vmg->start, end
(here denoted as start', end') set as:
start' end'
|<-->|
|----------|------|
| vma | next |
|----------|------|
So we now erroneously try to split accordingly. This is where the
unfortunate stuff begins.
We start with:
/* Split any preceding portion of the VMA. */
if (vma->vm_start < vmg->start) {
...
}
This doesn't trigger as we are no longer offset into vma at the start.
But then we invoke:
/* Split any trailing portion of the VMA. */
if (vma->vm_end > vmg->end) {
...
}
Which does get invoked. This leaves us with:
start' end'
|<-->|
|----|-----|------|
| vma| new | next |
|----|-----|------|
We then return ultimately to madvise_walk_vmas(). Here 'new' is unknown,
and putting back the values known in this function we are faced with:
start tmp end
| | |
|----|-----|------|
| vma| new | next |
|----|-----|------|
prev
Then:
start = tmp;
So:
start end
| |
|----|-----|------|
| vma| new | next |
|----|-----|------|
prev
The following code does not cause anything to happen:
if (prev && start < prev->vm_end)
start = prev->vm_end;
if (start >= end)
break;
And then we invoke:
if (prev)
vma = find_vma(mm, prev->vm_end);
Which is where a problem occurs - we don't know about 'new' so we
essentially look for the vma after prev, which is new, whereas we actually
intended to discover next!
So we end up with:
start end
| |
|----|-----|------|
|prev| vma | next |
|----|-----|------|
And we have successfully bypassed all of the checks madvise_walk_vmas()
has to ensure early exit should we end up moving out of range.
Where start == tmp. That is, a zero range. This is not good.
We invoke visit() which is madvise_vma_behavior() which does not check the
range (for good reason, it assumes all checks have been done before it was
called), which in turn finally calls madvise_update_vma().
The madvise_update_vma() function calls vma_modify_flags_name() in turn,
which ultimately invokes vma_modify() with... start == end.
vma_modify() calls vma_merge_existing_range() and finally we hit:
VM_WARN_ON_VMG(start >= end, vmg);
Which triggers, as start == end.
While it might be useful to add some CONFIG_DEBUG_VM asserts in these
instances to catch this kind of error, since we have just eliminated any
possibility of that happening, we will add such asserts separately as to
reduce churn and aid backporting.
Ge Yang [Wed, 19 Feb 2025 03:46:44 +0000 (11:46 +0800)]
mm/hugetlb: wait for hugetlb folios to be freed
Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing
of huge pages if in non-task context"), which supports deferring the
freeing of hugetlb pages, the allocation of contiguous memory through
cma_alloc() may fail probabilistically.
In the CMA allocation process, if it is found that the CMA area is
occupied by in-use hugetlb folios, these in-use hugetlb folios need to be
migrated to another location. When there are no available hugetlb folios
in the free hugetlb pool during the migration of in-use hugetlb folios,
new folios are allocated from the buddy system. A temporary state is set
on the newly allocated folio. Upon completion of the hugetlb folio
migration, the temporary state is transferred from the new folios to the
old folios. Normally, when the old folios with the temporary state are
freed, it is directly released back to the buddy system. However, due to
the deferred freeing of hugetlb pages, the PageBuddy() check fails,
ultimately leading to the failure of cma_alloc().
Here is a simplified call trace illustrating the process:
cma_alloc()
->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios
->unmap_and_move_huge_page()
->folio_putback_hugetlb() // Free old folios
->test_pages_isolated()
->__test_page_isolated_in_pageblock()
->PageBuddy(page) // Check if the page is in buddy
To resolve this issue, we have implemented a function named
wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb
folios are properly released back to the buddy system after their
migration is completed. By invoking wait_for_freed_hugetlb_folios()
before calling PageBuddy(), we ensure that PageBuddy() will succeed.
Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gaoxu [Wed, 19 Feb 2025 01:56:28 +0000 (01:56 +0000)]
mm: fix possible NULL pointer dereference in __swap_duplicate
Add a NULL check on the return value of swp_swap_info in __swap_duplicate
to prevent crashes caused by NULL pointer dereference.
The reason why swp_swap_info() returns NULL is unclear; it may be due
to CPU cache issues or DDR bit flips. The probability of this issue is
very small - it has been observed to occur approximately 1 in 500,000
times per week. The stack info we encountered is as follows:
The patch seems to only provide a workaround, but there are no more
effective software solutions to handle the bit flips problem. This path
will change the issue from a system crash to a process exception, thereby
reducing the impact on the entire machine.
Signed-off-by: gao xu <gaoxu2@honor.com> Link: https://lkml.kernel.org/r/e223b0e6ba2f4924984b1917cc717bd5@honor.com Reviewed-by: Barry Song <baohua@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmsan_handle_dma() is used by virtio_ring() which can be built as a
module. kmsan_handle_dma() needs to be exported otherwise building the
virtio_ring fails.
Export kmsan_handle_dma for modules.
Link: https://lkml.kernel.org/r/20250218091411.MMS3wBN9@linutronix.de Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502150634.qjxwSeJR-lkp@intel.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Macro Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gwan-gyeong Mun [Mon, 17 Feb 2025 11:41:33 +0000 (13:41 +0200)]
x86/vmemmap: use direct-mapped VA instead of vmemmap-based VA
Address an Oops issues when performing test of loading XE GPU driver
module after applying the GPU SVM and Xe SVM patch series[1] and the Dept
patch series[2].
The issue occurs when loading the xe driver via modprobe [3], which adds a
struct page for device memory via devm_memremap_pages(). When a process
leads the addition of a struct page to vmemmap (e.g. hot-plug), the page
table update for the newly added vmemmap-based virtual address is updated
first in init_mm's page table and then synchronized later.
If the vmemmap-based virtual address is accessed through the process's
page table before this sync, a page fault will occur. This patch
translates vmemmap-based virtual address to direct-mapped virtual address
and use it, if the current top-level page table is not init_mm's page
table when accessing a vmemmap-based virtual address before this sync.
When a process leads the addition of a struct page to vmemmap
(e.g. hot-plug), the page table update for the newly added vmemmap-based
virtual address is updated first in init_mm's page table and then
synchronized later.
If the vmemmap-based virtual address is accessed through the process's
page table before this sync, a page fault will occur.
This translates vmemmap-based virtual address to direct-mapped virtual
address and use it, if the current top-level page table is not init_mm's
page table when accessing a vmemmap-based virtual address before this sync.
Link: https://lkml.kernel.org/r/20250217114133.400063-2-gwan-gyeong.mun@intel.com Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs") Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ma Wupeng [Mon, 17 Feb 2025 01:43:29 +0000 (09:43 +0800)]
hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folio
Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to
be offlined) add page poison checks in do_migrate_range in order to make
offline hwpoisoned page possible by introducing isolate_lru_page and
try_to_unmap for hwpoisoned page. However folio lock must be held before
calling try_to_unmap. Add it to fix this problem.
Warning will be produced if folio is not locked during unmap:
Link: https://lkml.kernel.org/r/20250217014329.3610326-4-mawupeng1@huawei.com Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ma Wupeng [Mon, 17 Feb 2025 01:43:28 +0000 (09:43 +0800)]
mm: memory-hotplug: check folio ref count first in do_migrate_range
If a folio has an increased reference count, folio_try_get() will acquire
it, perform necessary operations, and then release it. In the case of a
poisoned folio without an elevated reference count (which is unlikely for
memory-failure), folio_try_get() will simply bypass it.
Therefore, relocate the folio_try_get() function, responsible for checking
and acquiring this reference count at first.
Link: https://lkml.kernel.org/r/20250217014329.3610326-3-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
unmap_poisoned_folio(): remove shadowed local `mapping', per Miaohe
Link: https://lkml.kernel.org/r/20250219060653.3849083-1-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ma Wupeng [Mon, 17 Feb 2025 01:43:27 +0000 (09:43 +0800)]
mm: memory-failure: update ttu flag inside unmap_poisoned_folio
Patch series "mm: memory_failure: unmap poisoned folio during migrate
properly", v3.
Fix two bugs during folio migration if the folio is poisoned.
This patch (of 3):
Commit 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to
TTU_HWPOISON") introduce TTU_HWPOISON to replace TTU_IGNORE_HWPOISON in
order to stop send SIGBUS signal when accessing an error page after a
memory error on a clean folio. However during page migration, anon folio
must be set with TTU_HWPOISON during unmap_*(). For pagecache we need
some policy just like the one in hwpoison_user_mappings to set this flag.
So move this policy from hwpoison_user_mappings to unmap_poisoned_folio to
handle this warning properly.
Warning will be produced during unamp poison folio with the following log:
Link: https://lkml.kernel.org/r/20250217014329.3610326-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-2-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Mon, 17 Feb 2025 02:49:24 +0000 (10:49 +0800)]
arm: pgtable: fix NULL pointer dereference issue
When update_mmu_cache_range() is called by update_mmu_cache(), the vmf
parameter is NULL, which will cause a NULL pointer dereference issue in
adjust_pte():
Unable to handle kernel NULL pointer dereference at virtual address 00000030 when read
Hardware name: Atmel AT91SAM9
PC is at update_mmu_cache_range+0x1e0/0x278
LR is at pte_offset_map_rw_nolock+0x18/0x2c
Call trace:
update_mmu_cache_range from remove_migration_pte+0x29c/0x2ec
remove_migration_pte from rmap_walk_file+0xcc/0x130
rmap_walk_file from remove_migration_ptes+0x90/0xa4
remove_migration_ptes from migrate_pages_batch+0x6d4/0x858
migrate_pages_batch from migrate_pages+0x188/0x488
migrate_pages from compact_zone+0x56c/0x954
compact_zone from compact_node+0x90/0xf0
compact_node from kcompactd+0x1d4/0x204
kcompactd from kthread+0x120/0x12c
kthread from ret_from_fork+0x14/0x38
Exception stack(0xc0d8bfb0 to 0xc0d8bff8)
To fix it, do not rely on whether 'ptl' is equal to decide whether to hold
the pte lock, but decide it by whether CONFIG_SPLIT_PTE_PTLOCKS is
enabled. In addition, if two vmas map to the same PTE page, there is no
need to hold the pte lock again, otherwise a deadlock will occur. Just
add the need_lock parameter to let adjust_pte() know this information.
Link: https://lkml.kernel.org/r/20250217024924.57996-1-zhengqi.arch@bytedance.com Fixes: fc9c45b71f43 ("arm: adjust_pte() use pte_offset_map_rw_nolock()") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Ezra Buehler <ezra.buehler@husqvarnagroup.com> Closes: https://lore.kernel.org/lkml/CAM1KZSmZ2T_riHvay+7cKEFxoPgeVpHkVFTzVVEQ1BO0cLkHEQ@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Ezra Buehler <ezra.buehler@husqvarnagroup.com> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russel King <linux@armlinux.org.uk> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 17 Feb 2025 18:23:04 +0000 (10:23 -0800)]
selftests/damon/damos_quota_goal: handle minimum quota that cannot be further reduced
damos_quota_goal.py selftest see if DAMOS quota goals tuning feature
increases or reduces the effective size quota for given score as expected.
The tuning feature sets the minimum quota size as one byte, so if the
effective size quota is already one, we cannot expect it further be
reduced. However the test is not aware of the edge case, and fails since
it shown no expected change of the effective quota. Handle the case by
updating the failure logic for no change to see if it was the case, and
simply skips to next test input.
Link: https://lkml.kernel.org/r/20250217182304.45215-1-sj@kernel.org Fixes: f1c07c0a1662 ("selftests/damon: add a test for DAMOS quota goal") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202502171423.b28a918d-lkp@intel.com Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Cc: <stable@vger.kernel.org> [6.10.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The general approach described in commit e076eaca5906 ("selftests: break
the dependency upon local header files") was taken one step too far here:
it should not have been extended to include the syscall numbers. This is
because doing so would require per-arch support in tools/include/uapi, and
no such support exists.
This revert fixes two separate reports of test failures, from Dave
Hansen[1], and Li Wang[2]. An excerpt of Dave's report:
Kemeng Shi [Thu, 13 Feb 2025 16:36:59 +0000 (00:36 +0800)]
test_xarray: fix failure in check_pause when CONFIG_XARRAY_MULTI is not defined
In case CONFIG_XARRAY_MULTI is not defined, xa_store_order can store a
multi-index entry but xas_for_each can't tell sbiling entry from valid
entry. So the check_pause failed when we store a multi-index entry and
wish xas_for_each can handle it normally. Avoid to store multi-index
entry when CONFIG_XARRAY_MULTI is disabled to fix the failure.
Link: https://lkml.kernel.org/r/20250213163659.414309-1-shikemeng@huaweicloud.com Fixes: c9ba5249ef8b ("Xarray: move forward index correctly in xas_pause()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Closes: https://lore.kernel.org/r/CAMuHMdU_bfadUO=0OZ=AoQ9EAmQPA4wsLCBqohXR+QCeCKRn4A@mail.gmail.com Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit e30a0361b851 ("kasan: make report_lock a raw spinlock"),
report_lock was changed to raw_spinlock_t to fix another similar
PREEMPT_RT problem. That alone isn't enough to cover other corner cases.
print_address_description() is always invoked under the report_lock. The
context under this lock is always atomic even on PREEMPT_RT.
find_vm_area() acquires vmap_node::busy.lock which is a spinlock_t,
becoming a sleeping lock on PREEMPT_RT and must not be acquired in atomic
context.
Don't invoke find_vm_area() on PREEMPT_RT and just print the address.
Non-PREEMPT_RT builds remain unchanged. Add a DEFINE_WAIT_OVERRIDE_MAP()
macro to tell lockdep that this lock nesting is allowed because the
PREEMPT_RT part (which is invalid) has been taken care of. This macro was
first introduced in commit 0cce06ba859a ("debugobjects,locking: Annotate
debug_object_fill_pool() wait type violation").
Link: https://lkml.kernel.org/r/20250217204402.60533-1-longman@redhat.com Fixes: e30a0361b851 ("kasan: make report_lock a raw spinlock") Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mariano Pache <npache@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Brown [Wed, 12 Feb 2025 17:44:25 +0000 (17:44 +0000)]
selftests/mm: fix check for running THP tests
When testing if we should try to compact memory or drop caches before we
run the THP or HugeTLB tests we use | as an or operator. This doesn't
work since run_vmtests.sh is written in shell where this is used to pipe
the output of the first argument into the second. Instead use the shell's
-o operator.
Link: https://lkml.kernel.org/r/20250212-kselftest-mm-no-hugepages-v1-1-44702f538522@kernel.org Fixes: b433ffa8dbac ("selftests: mm: perform some system cleanup before using hugepages") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Mariano Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Luiz Capitulino [Tue, 11 Feb 2025 03:48:56 +0000 (22:48 -0500)]
mm: hugetlb: avoid fallback for specific node allocation of 1G pages
When using the HugeTLB kernel command-line to allocate 1G pages from a
specific node, such as:
default_hugepagesz=1G hugepages=1:1
If node 1 happens to not have enough memory for the requested number of 1G
pages, the allocation falls back to other nodes. A quick way to reproduce
this is by creating a KVM guest with a memory-less node and trying to
allocate 1 1G page from it. Instead of failing, the allocation will
fallback to other nodes.
This defeats the purpose of node specific allocation. Also, specific node
allocation for 2M pages don't have this behavior: the allocation will just
fail for the pages it can't satisfy.
This issue happens because HugeTLB calls memblock_alloc_try_nid_raw() for
1G boot-time allocation as this function falls back to other nodes if the
allocation can't be satisfied. Use memblock_alloc_exact_nid_raw()
instead, which ensures that the allocation will only be satisfied from the
specified node.
Link: https://lkml.kernel.org/r/20250211034856.629371-1-luizcap@redhat.com Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The stress test involves CPU hotplug operations and memory control group
(memcg) operations. The scenario can be described as follows:
echo xx > memory.max cache_ap_online oom_reaper
(CPU23) (CPU50)
xx < usage stop_machine_from_inactive_cpu
for(;;) // all active cpus
trigger OOM queue_stop_cpus_work
// waiting oom_reaper
multi_cpu_stop(migration/xx)
// sync all active cpus ack
// waiting cpu23 ack
// CPU50 loops in multi_cpu_stop
waiting cpu50
Detailed explanation:
1. When the usage is larger than xx, an OOM may be triggered. If the
process does not handle with ths kill signal immediately, it will loop
in the memory_max_write.
2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
active cpus. Within the multi_cpu_stop function, it attempts to
synchronize the CPU states. However, the CPU23 didn't acknowledge
because it is stuck in a loop within the for(;;).
3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
for CPU23 to acknowledge the synchronization request.
4. Finally, it formed cyclic dependency and lead to softlockup and dead
loop.
To fix this issue, add cond_resched() in the memory_max_write, so that it
will not block migration task.
Link: https://lkml.kernel.org/r/20250211081819.33307-1-chenridong@huaweicloud.com Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Wang Weiyang <wangweiyang2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 11 Feb 2025 07:26:25 +0000 (15:26 +0800)]
mm: pgtable: fix incorrect reclaim of non-empty PTE pages
In zap_pte_range(), if the pte lock was released midway, the pte entries
may be refilled with physical pages by another thread, which may cause a
non-empty PTE page to be reclaimed and eventually cause the system to
crash.
To fix it, fall back to the slow path in this case to recheck if all pte
entries are still none.
Link: https://lkml.kernel.org/r/20250211072625.89188-1-zhengqi.arch@bytedance.com Fixes: 6375e95f381e ("mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Christian Brauner <brauner@kernel.org> Closes: https://lore.kernel.org/all/20250207-anbot-bankfilialen-acce9d79a2c7@brauner/ Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Closes: https://lore.kernel.org/all/152296f3-5c81-4a94-97f3-004108fba7be@gmx.com/ Tested-by: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Sat, 8 Feb 2025 06:49:01 +0000 (14:49 +0800)]
taskstats: modify taskstats version
After adding "delay max" and "delay min" to the taskstats structure, the
taskstats version needs to be updated.
Link: https://lkml.kernel.org/r/20250208144901218Q5ptVpqsQkb2MOEmW4Ujn@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Sat, 8 Feb 2025 06:44:00 +0000 (14:44 +0800)]
getdelays: fix error format characters
getdelays had a compilation issue because the format string was not
updated when the "delay min" was added. For example, after adding the
"delay min" in printf, there were 7 strings but only 6 "%s" format
specifiers. Similarly, after adding the 't->cpu_delay_total', there were
7 variables but only 6 format characters specifiers, causing compilation
issues as follows. This commit fixes these issues to ensure that
getdelays compiles correctly.
root@xx:~/linux-next/tools/accounting$ make
getdelays.c:199:9: warning: format `%llu' expects argument of type
`long long unsigned int', but argument 8 has type `char *' [-Wformat=]
199 | printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.....
216 | "delay total", "delay average", "delay max", "delay min",
| ~~~~~~~~~~~
| |
| char *
getdelays.c:200:21: note: format string is defined here
200 | " %15llu%15llu%15llu%15llu%15.3fms%13.6fms\n"
| ~~~~~^
| |
| long long unsigned int
| %15s
getdelays.c:199:9: warning: format `%f' expects argument of type
`double', but argument 12 has type `long long unsigned int' [-Wformat=]
199 | printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.....
220 | (unsigned long long)t->cpu_delay_total,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| long long unsigned int
.....
Link: https://lkml.kernel.org/r/20250208144400544RduNRhwIpT3m2JyRBqskZ@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
If migration succeeded, we called
folio_migrate_flags()->mem_cgroup_migrate() to migrate the memcg from the
old to the new folio. This will set memcg_data of the old folio to 0.
Similarly, if migration failed, memcg_data of the dst folio is left unset.
If we call folio_putback_lru() on such folios (memcg_data == 0), we will
add the folio to be freed to the LRU, making memcg code unhappy. Running
the hmm selftests:
Likely, nothing else goes wrong: putting the last folio reference will
remove the folio from the LRU again. So besides memcg complaining, adding
the folio to be freed to the LRU is just an unnecessary step.
The new flow resembles what we have in migrate_folio_move(): add the dst
to the lru, remove migration ptes, unlock and unref dst.
Link: https://lkml.kernel.org/r/20250210161317.717936-1-david@redhat.com Fixes: 8763cb45ab96 ("mm/migrate: new memory migration helper for use with device memory") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm,madvise,hugetlb: check for 0-length range after end address adjustment
Add a sanity check to madvise_dontneed_free() to address a corner case in
madvise where a race condition causes the current vma being processed to
be backed by a different page size.
During a madvise(MADV_DONTNEED) call on a memory region registered with a
userfaultfd, there's a period of time where the process mm lock is
temporarily released in order to send a UFFD_EVENT_REMOVE and let
userspace handle the event. During this time, the vma covering the
current address range may change due to an explicit mmap done concurrently
by another thread.
If, after that change, the memory region, which was originally backed by
4KB pages, is now backed by hugepages, the end address is rounded down to
a hugepage boundary to avoid data loss (see "Fixes" below). This rounding
may cause the end address to be truncated to the same address as the
start.
Make this corner case follow the same semantics as in other similar cases
where the requested region has zero length (ie. return 0).
This will make madvise_walk_vmas() continue to the next vma in the range
(this time holding the process mm lock) which, due to the prev pointer
becoming stale because of the vma change, will be the same hugepage-backed
vma that was just checked before. The next time madvise_dontneed_free()
runs for this vma, if the start address isn't aligned to a hugepage
boundary, it'll return -EINVAL, which is also in line with the madvise
api.
From userspace perspective, madvise() will return EINVAL because the start
address isn't aligned according to the new vma alignment requirements
(hugepage), even though it was correctly page-aligned when the call was
issued.
Link: https://lkml.kernel.org/r/20250203075206.1452208-1-rcn@igalia.com Fixes: 8ebe0a5eaaeb ("mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Florent Revest <revest@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hyeonggon Yoo [Wed, 29 Jan 2025 10:08:44 +0000 (19:08 +0900)]
mm/zswap: fix inconsistency when zswap_store_page() fails
Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
skips charging any zswap entries when it failed to zswap the entire folio.
However, when some base pages are zswapped but it failed to zswap the
entire folio, the zswap operation is rolled back. When freeing zswap
entries for those pages, zswap_entry_free() uncharges the zswap entries
that were not previously charged, causing zswap charging to become
inconsistent.
This inconsistency triggers two warnings with following steps:
# On a machine with 64GiB of RAM and 36GiB of zswap
$ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
$ sudo reboot
The two warnings are:
in mm/memcontrol.c:163, function obj_cgroup_release():
WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
in mm/page_counter.c:60, function page_counter_cancel():
if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n",
new, nr_pages))
zswap_stored_pages also becomes inconsistent in the same way.
As suggested by Kanchana, increment zswap_stored_pages and charge zswap
entries within zswap_store_page() when it succeeds. This way,
zswap_entry_free() will decrement the counter and uncharge the entries
when it failed to zswap the entire folio.
While this could potentially be optimized by batching objcg charging and
incrementing the counter, let's focus on fixing the bug this time and
leave the optimization for later after some evaluation.
After resolving the inconsistency, the warnings disappear.
[42.hyeyoo@gmail.com: refactor zswap_store_page()] Link: https://lkml.kernel.org/r/20250131082037.2426-1-42.hyeyoo@gmail.com Link: https://lkml.kernel.org/r/20250129100844.2935-1-42.hyeyoo@gmail.com Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
import_iovec() says that it should always be fine to kfree the iovec
returned in @iovp regardless of the error code. __import_iovec_ubuf()
never reallocates it and thus should clear the pointer even in cases when
copy_iovec_*() fail.
Link: https://lkml.kernel.org/r/378ae26923ffc20fd5e41b4360d673bf47b1775b.1738332461.git.asml.silence@gmail.com Fixes: 3b2deb0e46da ("iov_iter: import single vector iovecs as ITER_UBUF") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bart Van Assche [Wed, 29 Jan 2025 22:20:03 +0000 (14:20 -0800)]
procfs: fix a locking bug in a vmcore_add_device_dump() error path
Unlock vmcore_mutex when returning -EBUSY.
Link: https://lkml.kernel.org/r/20250129222003.1495713-1-bvanassche@acm.org Fixes: 0f3b1c40c652 ("fs/proc/vmcore: disallow vmcore modifications while the vmcore is open") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baoquan he <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Sun, 16 Feb 2025 20:58:51 +0000 (12:58 -0800)]
Merge tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix annoying logs when building tools in parallel
- Fix the Debian linux-headers package build again
- Fix the target triple detection for userspace programs on Clang
* tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
modpost: Fix a few typos in a comment
kbuild: userprogs: fix bitsize and target detection on clang
kbuild: fix linux-headers package build when $(CC) cannot link userspace
tools: fix annoying "mkdir -p ..." logs when building tools in parallel
Linus Torvalds [Sun, 16 Feb 2025 20:54:42 +0000 (12:54 -0800)]
Merge tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core api addition from Greg KH:
"Here is a driver core new api for 6.14-rc3 that is being added to
allow platform devices from stop being abused.
It adds a new 'faux_device' structure and bus and api to allow almost
a straight or simpler conversion from platform devices that were not
really a platform device. It also comes with a binding for rust, with
an example driver in rust showing how it's used.
I'm adding this now so that the patches that convert the different
drivers and subsystems can all start flowing into linux-next now
through their different development trees, in time for 6.15-rc1.
We have a number that are already reviewed and tested, but adding
those conversions now doesn't seem right. For now, no one is using
this, and it passes all build tests from 0-day and linux-next, so all
should be good"
* tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
rust/kernel: Add faux device bindings
driver core: add a faux bus for use when a simple device/bus is needed
Linus Torvalds [Sun, 16 Feb 2025 19:15:50 +0000 (11:15 -0800)]
Merge tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes, and new device ids, for
6.14-rc3. Lots of tiny stuff for reported problems, including:
- new device ids and quirks
- usb hub crash fix found by syzbot
- dwc2 driver fix
- dwc3 driver fixes
- uvc gadget driver fix
- cdc-acm driver fixes for a variety of different issues
- other tiny bugfixes
Almost all of these have been in linux-next this week, and all have
passed 0-day testing"
* tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
usb: typec: tcpm: PSSourceOffTimer timeout in PR_Swap enters ERROR_RECOVERY
usb: roles: set switch registered flag early on
usb: gadget: uvc: Fix unstarted kthread worker
USB: quirks: add USB_QUIRK_NO_LPM quirk for Teclast dist
usb: gadget: core: flush gadget workqueue after device removal
USB: gadget: f_midi: f_midi_complete to call queue_work
usb: core: fix pipe creation for get_bMaxPacketSize0
usb: dwc3: Fix timeout issue during controller enter/exit from halt state
USB: Add USB_QUIRK_NO_LPM quirk for sony xperia xz1 smartphone
USB: cdc-acm: Fill in Renesas R-Car D3 USB Download mode quirk
usb: cdc-acm: Fix handling of oversized fragments
usb: cdc-acm: Check control transfer buffer size before access
usb: xhci: Restore xhci_pci support for Renesas HCs
USB: pci-quirks: Fix HCCPARAMS register error for LS7A EHCI
USB: serial: option: drop MeiG Smart defines
USB: serial: option: fix Telit Cinterion FN990A name
USB: serial: option: add Telit Cinterion FN990B compositions
USB: serial: option: add MeiG Smart SLM828
usb: gadget: f_midi: fix MIDI Streaming descriptor lengths
usb: dwc2: gadget: remove of_node reference upon udc_stop
...
Linus Torvalds [Sun, 16 Feb 2025 18:41:50 +0000 (10:41 -0800)]
Merge tag 'perf_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:
- Explicitly clear DEBUGCTL.LBR to prevent LBRs continuing being
enabled after handoff to the OS
- Check CPUID(0x23) leaf and subleafs presence properly
- Remove the PEBS-via-PT feature from being supported on hybrid systems
- Fix perf record/top default commands on systems without a raw PMU
registered
* tag 'perf_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Ensure LBRs are disabled when a CPU is starting
perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
perf/x86/intel: Clean up PEBS-via-PT on hybrid
perf/x86/rapl: Fix the error checking order
Linus Torvalds [Sun, 16 Feb 2025 18:25:12 +0000 (10:25 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Large set of fixes for vector handling, especially in the
interactions between host and guest state.
This fixes a number of bugs affecting actual deployments, and
greatly simplifies the FP/SIMD/SVE handling. Thanks to Mark Rutland
for dealing with this thankless task.
- Fix an ugly race between vcpu and vgic creation/init, resulting in
unexpected behaviours
- Fix use of kernel VAs at EL2 when emulating timers with nVHE
- Small set of pKVM improvements and cleanups
x86:
- Fix broken SNP support with KVM module built-in, ensuring the PSP
module is initialized before KVM even when the module
infrastructure cannot be used to order initcalls
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being
emulated by KVM to fix a NULL pointer dereference
- Enter guest mode (L2) from KVM's perspective before initializing
the vCPU's nested NPT MMU so that the MMU is properly tagged for
L2, not L1
- Load the guest's DR6 outside of the innermost .vcpu_run() loop, as
the guest's value may be stale if a VM-Exit is handled in the
fastpath"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
x86/sev: Fix broken SNP support with KVM module built-in
KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
crypto: ccp: Add external API interface for PSP module initialization
KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()
KVM: arm64: timer: Drop warning on failed interrupt signalling
KVM: arm64: Fix alignment of kvm_hyp_memcache allocations
KVM: arm64: Convert timer offset VA when accessed in HYP code
KVM: arm64: Simplify warning in kvm_arch_vcpu_load_fp()
KVM: arm64: Eagerly switch ZCR_EL{1,2}
KVM: arm64: Mark some header functions as inline
KVM: arm64: Refactor exit handlers
KVM: arm64: Refactor CPTR trap deactivation
KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
KVM: arm64: Remove host FPSIMD saving for non-protected KVM
KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
KVM: nSVM: Enter guest mode before initializing nested NPT MMU
KVM: selftests: Add CPUID tests for Hyper-V features that need in-kernel APIC
KVM: selftests: Manage CPUID array in Hyper-V CPUID test's core helper
...
Linus Torvalds [Sun, 16 Feb 2025 18:19:41 +0000 (10:19 -0800)]
Merge tag 'mips-fixes_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
"Fix for o32 ptrace/get_syscall_info"
* tag 'mips-fixes_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: fix mips_get_syscall_arg() for o32
MIPS: Export syscall stack arguments properly for remote use
Linus Torvalds [Sun, 16 Feb 2025 01:14:53 +0000 (17:14 -0800)]
Merge tag 'uml-for-linus-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux
Pull UML fixes from Richard Weinberger:
- Align signal stack correctly
- Convert to raw spinlocks where needed (irq and virtio)
- FPU related fixes
* tag 'uml-for-linus-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
um: convert irq_lock to raw spinlock
um: virtio_uml: use raw spinlock
um: virt-pci: don't use kmalloc()
um: fix execve stub execution on old host OSs
um: properly align signal stack on x86_64
um: avoid copying FP state from init_task
um: add back support for FXSAVE registers
Linus Torvalds [Sun, 16 Feb 2025 00:34:41 +0000 (16:34 -0800)]
Merge tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull trace ring buffer fixes from Steven Rostedt:
- Enable resize on mmap() error
When a process mmaps a ring buffer, its size is locked and resizing
is disabled. But if the user passes in a wrong parameter, the mmap()
can fail after the resize was disabled and the mmap() exits with
error without reenabling the ring buffer resize. This prevents the
ring buffer from ever being resized after that. Reenable resizing of
the ring buffer on mmap() error.
- Have resizing return proper error and not always -ENOMEM
If the ring buffer is mmapped by one task and another task tries to
resize the buffer it will error with -ENOMEM. This is confusing to
the user as there may be plenty of memory available. Have it return
the error that actually happens (in this case -EBUSY) where the user
can understand why the resize failed.
- Test the sub-buffer array to validate persistent memory buffer
On boot up, the initialization of the persistent memory buffer will
do a validation check to see if the content of the data is valid, and
if so, it will use the memory as is, otherwise it re-initializes it.
There's meta data in this persistent memory that keeps track of which
sub-buffer is the reader page and an array that states the order of
the sub-buffers. The values in this array are indexes into the
sub-buffers. The validator checks to make sure that all the entries
in the array are within the sub-buffer list index, but it does not
check for duplications.
While working on this code, the array got corrupted and had
duplicates, where not all the sub-buffers were accounted for. This
passed the validator as all entries were valid, but the link list was
incorrect and could have caused a crash. The corruption only produced
incorrect data, but it could have been more severe. To fix this,
create a bitmask that covers all the sub-buffer indexes and set it to
all zeros. While iterating the array checking the values of the array
content, have it set a bit corresponding to the index in the array.
If the bit was already set, then it is a duplicate and mark the
buffer as invalid and reset it.
- Prevent mmap()ing persistent ring buffer
The persistent ring buffer uses vmap() to map the persistent memory.
Currently, the mmap() logic only uses virt_to_page() to get the page
from the ring buffer memory and use that to map to user space. This
works because a normal ring buffer uses alloc_page() to allocate its
memory. But because the persistent ring buffer use vmap() it causes a
kernel crash.
Fixing this to work with vmap() is not hard, but since mmap() on
persistent memory buffers never worked, just have the mmap() return
-ENODEV (what was returned before mmap() for persistent memory ring
buffers, as they never supported mmap. Normal buffers will still
allow mmap(). Implementing mmap() for persistent memory ring buffers
can wait till the next merge window.
- Fix polling on persistent ring buffers
There's a "buffer_percent" option (default set to 50), that is used
to have reads of the ring buffer binary data block until the buffer
fills to that percentage. The field "pages_touched" is incremented
every time a new sub-buffer has content added to it. This field is
used in the calculations to determine the amount of content is in the
buffer and if it exceeds the "buffer_percent" then it will wake the
task polling on the buffer.
As persistent ring buffers can be created by the content from a
previous boot, the "pages_touched" field was not updated. This means
that if a task were to poll on the persistent buffer, it would block
even if the buffer was completely full. It would block even if the
"buffer_percent" was zero, because with "pages_touched" as zero, it
would be calculated as the buffer having no content. Update
pages_touched when initializing the persistent ring buffer from a
previous boot.
* tag 'trace-ring-buffer-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Update pages_touched to reflect persistent buffer content
tracing: Do not allow mmap() of persistent ring buffer
ring-buffer: Validate the persistent meta data subbuf array
tracing: Have the error of __tracing_resize_ring_buffer() passed to user
ring-buffer: Unlock resize on mmap error
Steven Rostedt [Fri, 14 Feb 2025 17:35:12 +0000 (12:35 -0500)]
ring-buffer: Update pages_touched to reflect persistent buffer content
The pages_touched field represents the number of subbuffers in the ring
buffer that have content that can be read. This is used in accounting of
"dirty_pages" and "buffer_percent" to allow the user to wait for the
buffer to be filled to a certain amount before it reads the buffer in
blocking mode.
The persistent buffer never updated this value so it was set to zero, and
this accounting would take it as it had no content. This would cause user
space to wait for content even though there's enough content in the ring
buffer that satisfies the buffer_percent.
Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
But virt_to_page() does not work with vmap()'d memory which is what the
persistent ring buffer has. It is rather trivial to allow this, but for
now just disable mmap() of instances that have their ring buffer from the
reserve_mem option.
If an mmap() is performed on a persistent buffer it will return -ENODEV
just like it would if the .mmap field wasn't defined in the
file_operations structure.
Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Sat, 15 Feb 2025 18:20:47 +0000 (10:20 -0800)]
Merge tag 'i2c-for-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"MAINTAINERS maintenance.
Changed email, added entry, deleted entry falling back to a generic
one"
* tag 'i2c-for-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: Add maintainer for Qualcomm's I2C GENI driver
MAINTAINERS: delete entry for AXXIA I2C
MAINTAINERS: Use my kernel.org address for I2C ACPI work
Linus Torvalds [Sat, 15 Feb 2025 18:15:24 +0000 (10:15 -0800)]
Merge tag 's390-6.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix isolated VFs handling by verifying that a VF’s parent PF is
locally owned before registering it in an existing PCI domain
- Disable arch_test_bit() optimization for PROFILE_ALL_BRANCHES to
workaround gcc failure in handling __builtin_constant_p() in this
case
- Fix CHPID "configure" attribute caching in CIO by not updating the
cache when SCLP returns no data, ensuring consistent sysfs output
- Remove CONFIG_LSM from default configs and rely on defaults, which
enables BPF LSM hook
* tag 's390-6.14-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: Fix handling of isolated VFs
s390/pci: Pull search for parent PF out of zpci_iov_setup_virtfn()
s390/bitops: Disable arch_test_bit() optimization for PROFILE_ALL_BRANCHES
s390/cio: Fix CHPID "configure" attribute caching
s390/configs: Remove CONFIG_LSM
Thomas Weißschuh [Thu, 13 Feb 2025 14:55:17 +0000 (15:55 +0100)]
kbuild: userprogs: fix bitsize and target detection on clang
scripts/Makefile.clang was changed in the linked commit to move --target from
KBUILD_CFLAGS to KBUILD_CPPFLAGS, as that generally has a broader scope.
However that variable is not inspected by the userprogs logic,
breaking cross compilation on clang.
Use both variables to detect bitsize and target arguments for userprogs.
Linus Torvalds [Sat, 15 Feb 2025 17:54:46 +0000 (09:54 -0800)]
Merge tag 'rust-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull rust fixes from Miguel Ojeda:
- Fix objtool warning due to future Rust 1.85.0 (to be released in a
few days)
- Clean future Rust 1.86.0 (to be released 2025-04-03) Clippy warning
* tag 'rust-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: rbtree: fix overindented list item
objtool/rust: add one more `noreturn` Rust function
which causes problems on 32-bit x86 configurations that have 64-bit
resource sizes:
x86_64-linux-ld: drivers/dma/tegra210-adma.o: in function `tegra_adma_probe':
tegra210-adma.c:(.text+0x1322): undefined reference to `__udivdi3'
because gcc doesn't generate the trivial code for a 64-by-32 divide,
turning it into a function call to do a full 64-by-64 divide. And the
kernel intentionally doesn't provide that helper function, because 99%
of the time all you want is the narrower version.
Of course, tegra210 is a 64-bit architecture and the 32-bit x86 build is
purely for build testing, so this really is just about build coverage
failure.
But build coverage is good.
Side note: div_u64() would be suboptimal if you actually have a 32-bit
resource_t, so our "helper" for divides are admittedly making it harder
than it should be to generate good code for all the possible cases.
At some point, I'll consider 32-bit x86 so entirely legacy that I can't
find it in myself to care any more, and we'll just add the __udivdi3
library function.
But for now, the right thing to do is to use "div_u64()" to show that
you know that you are doing the simpler divide with a 32-bit number.
And the build error enforces that.
While fixing the build issue, also check for division-by-zero, and for
overflow. Which hopefully cannot happen on real production hardware,
but the value of 'ch_base_offset' can definitely be zero in other
places.
Linus Torvalds [Sat, 15 Feb 2025 16:13:45 +0000 (08:13 -0800)]
Merge tag 'gpio-fixes-for-v6.14-rc3-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix interrupt handling issues in gpio-bcm-kona
- add an ACPI quirk for Acer Nitro ANV14 fixing an issue with spurious
wake up events
- add missing return value checks to gpio-stmpe
- fix a crash in error path in gpiochip_get_ngpios()
* tag 'gpio-fixes-for-v6.14-rc3-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: Fix crash on error in gpiochip_get_ngpios()
gpio: stmpe: Check return value of stmpe_reg_read in stmpe_gpio_irq_sync_unlock
gpiolib: acpi: Add a quirk for Acer Nitro ANV14
gpio: bcm-kona: Add missing newline to dev_err format string
gpio: bcm-kona: Make sure GPIO bits are unlocked when requesting IRQ
gpio: bcm-kona: Fix GPIO lock/unlock for banks above bank 0
Masahiro Yamada [Thu, 13 Feb 2025 06:26:44 +0000 (15:26 +0900)]
kbuild: fix linux-headers package build when $(CC) cannot link userspace
Since commit 5f73e7d0386d ("kbuild: refactor cross-compiling
linux-headers package"), the linux-headers Debian package fails to
build when $(CC) cannot build userspace applications, for example,
when using toolchains installed by the 0day bot.
The host programs in the linux-headers package should be rebuilt using
the disto's cross-compiler, ${DEB_HOST_GNU_TYPE}-gcc instead of $(CC).
Hence, the variable 'CC' must be expanded in this shell script instead
of in the top-level Makefile.
Commit f354fc88a72a ("kbuild: install-extmod-build: add missing
quotation marks for CC variable") was not a correct fix because
CC="ccache gcc" should be unrelated when rebuilding userspace tools.
Fixes: 5f73e7d0386d ("kbuild: refactor cross-compiling linux-headers package") Reported-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com> Closes: https://lore.kernel.org/linux-kbuild/CAK7LNARb3xO3ptBWOMpwKcyf3=zkfhMey5H2KnB1dOmUwM79dA@mail.gmail.com/T/#t Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Masahiro Yamada [Tue, 11 Feb 2025 00:29:06 +0000 (09:29 +0900)]
tools: fix annoying "mkdir -p ..." logs when building tools in parallel
When CONFIG_OBJTOOL=y or CONFIG_DEBUG_INFO_BTF=y, parallel builds
show awkward "mkdir -p ..." logs.
$ make -j16
[ snip ]
mkdir -p /home/masahiro/ref/linux/tools/objtool && make O=/home/masahiro/ref/linux subdir=tools/objtool --no-print-directory -C objtool
mkdir -p /home/masahiro/ref/linux/tools/bpf/resolve_btfids && make O=/home/masahiro/ref/linux subdir=tools/bpf/resolve_btfids --no-print-directory -C bpf/resolve_btfids
Defining MAKEFLAGS=<value> on the command line wipes out command line
switches from the resultant MAKEFLAGS definition, even though the command
line switches are active. [1]
MAKEFLAGS puts all single-letter options into the first word, and that
word will be empty if no single-letter options were given. [2]
However, this breaks if MAKEFLAGS=<value> is given on the command line.
The tools/ and tools/% targets set MAKEFLAGS=<value> on the command
line, which breaks the following code in tools/scripts/Makefile.include:
short-opts := $(firstword -$(MAKEFLAGS))
If MAKEFLAGS really needs modification, it should be done through the
environment variable, as follows:
MAKEFLAGS=<value> $(MAKE) ...
That said, I question whether modifying MAKEFLAGS is necessary here.
The only flag we might want to exclude is --no-print-directory, as the
tools build system changes the working directory. However, people might
find the "Entering/Leaving directory" logs annoying.
Linus Torvalds [Sat, 15 Feb 2025 03:56:12 +0000 (19:56 -0800)]
Merge tag 'alpha-fixes-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha
Pull alpha fixes from Matt Turner:
"A few changes for alpha, including some important fixes for kernel
stack alignment"
* tag 'alpha-fixes-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
alpha: Use str_yes_no() helper in pci_dac_dma_supported()
alpha: Replace one-element array with flexible array member
alpha: align stack for page fault and user unaligned trap handlers
alpha: make stack 16-byte aligned (most cases)
alpha: replace hardcoded stack offsets with autogenerated ones
Linus Torvalds [Sat, 15 Feb 2025 00:49:07 +0000 (16:49 -0800)]
Merge tag 'pci-v6.14-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Update a BUILD_BUG_ON() usage that works on current compilers, but
breaks compilation on gcc 5.3.1 (Alex Williamson)
- Avoid use of FLR for Mediatek MT7922 WiFi; the device previously
worked after a long timeout and fallback to SBR, but after a recent
RRS change it doesn't work at all after FLR (Bjorn Helgaas)
* tag 'pci-v6.14-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Avoid FLR for Mediatek MT7922 WiFi
PCI: Fix BUILD_BUG_ON usage for old gcc