Zqiang [Mon, 17 Nov 2025 12:53:11 +0000 (20:53 +0800)]
sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
The free_kick_syncs_rcu() rcu-callback only invoke kvfree() to
release per-cpu ksyncs object, this can use kvfree_rcu() replace
call_rcu() to release per-cpu ksyncs object in the free_kick_syncs().
Tejun Heo [Fri, 14 Nov 2025 01:33:41 +0000 (15:33 -1000)]
sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
With the buddy lockup detector, smp_processor_id() returns the detecting CPU,
not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the
locked CPU number from watchdog_hardlockup_check() as a parameter instead.
Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and
scx_rcu_cpu_stall() documenting their return value semantics.
Suggested-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Andrea Righi [Wed, 12 Nov 2025 17:29:45 +0000 (18:29 +0100)]
sched_ext: Update comments replacing breather with aborting mechanism
Commit 5ebec443fb96a ("sched_ext: Exit dispatch and move operations
immediately when aborting") replaced the breather mechanism with the
scx_aborting flag.
Update comments removing references to the breather mechanism to avoid
confusion.
Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:16 +0000 (09:18 -1000)]
sched_ext: Implement load balancer for bypass mode
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well
in most cases, there is a failure mode where a BPF scheduler can skew task
placement severely before triggering bypass in highly over-saturated systems.
If most tasks end up concentrated on a few CPUs, those CPUs can accumulate
queues that are too long to drain in a reasonable time, leading to RCU stalls
and hung tasks.
Implement a simple timer-based load balancer that redistributes tasks across
CPUs within each NUMA node. The balancer runs periodically (default 500ms,
tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded
CPUs to underloaded ones.
When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks
to avoid dropping and reacquiring the donor DSQ lock on each iteration, as
donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag
and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load
balancer timer function reads scx_bypass_depth locklessly to check whether
bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair
with the READ_ONCE() in the timer function.
This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k
runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost
all tasks end up on CPU0 creating severe imbalance. Without the load balancer,
disabling the scheduler can lead to RCU stalls and hung tasks, taking a very
long time to complete. With the load balancer, disable completes in about a
second.
The load balancing operation can be monitored using the sched_ext_bypass_lb
tracepoint and disabled by setting bypass_lb_intv_us to 0.
v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked()
to prevent races with dispatch_dequeue() (Andrea Righi).
Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:15 +0000 (09:18 -1000)]
sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
move_task_between_dsqs() contains open-coded abbreviated dequeue logic when
moving tasks between non-local DSQs. Factor this out into
dispatch_dequeue_locked() which can be used when both the task's rq and dsq
locks are already held. Add lockdep assertions to both dispatch_dequeue() and
the new helper to verify locking requirements.
This prepares for the load balancer which will need the same abbreviated
dequeue pattern.
Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:13 +0000 (09:18 -1000)]
sched_ext: Add scx_cpu0 example scheduler
Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and
only dispatches them from CPU0 in FIFO order. This is useful for testing bypass
behavior when many tasks are concentrated on a single CPU. If the load balancer
doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is
long and there's only one CPU working on it.
v2: Check whether task is on CPU0 at enqueue using scx_bpf_task_cpu() instead
of nr_cpus_allowed (Andrea Righi).
Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:12 +0000 (09:18 -1000)]
sched_ext: Hook up hardlockup detector
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
large system with many tasks pinned to different subsets of CPUs, if the BPF
scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
can be contended to the point where hardlockup triggers. Unfortunately,
hardlockup can be the first signal out of such situations, thus requiring
hardlockup handling.
Hook scx_hardlockup() into the hardlockup detector to try kicking out the
current scheduler in an attempt to recover the system to a good state. The
handling strategy can delay watchdog taking its own action by one polling
period; however, given that the only remediation for hardlockup is crash, this
is likely an acceptable trade-off.
v2: Add missing dummy scx_hardlockup() definition for
!CONFIG_SCHED_CLASS_EXT (kernel test bot).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Cc: Douglas Anderson <dianders@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:11 +0000 (09:18 -1000)]
sched_ext: Make handle_lockup() propagate scx_verror() result
handle_lockup() currently calls scx_verror() but ignores its return value,
always returning true when the scheduler is enabled. Make it capture and return
the result from scx_verror(). This prepares for hardlockup handling.
Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:10 +0000 (09:18 -1000)]
sched_ext: Refactor lockup handlers into handle_lockup()
scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the
scheduler is enabled under RCU read lock and trigger an error if so. Extract
the common pattern into handle_lockup() helper. Add scx_verror() macro and use
guard(rcu)().
This simplifies both handlers, reduces code duplication, and prepares for
hardlockup handling.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:09 +0000 (09:18 -1000)]
sched_ext: Make scx_exit() and scx_vexit() return bool
Make scx_exit() and scx_vexit() return bool indicating whether the calling
thread successfully claimed the exit. This will be used by the abort mechanism
added in a later patch.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:08 +0000 (09:18 -1000)]
sched_ext: Exit dispatch and move operations immediately when aborting
62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced
the breather mechanism to inject delays during bypass mode switching. It
maintains operation semantics unchanged while reducing lock contention to avoid
live-locks on large NUMA systems.
However, the breather only activates when exiting the scheduler, so there's no
need to maintain operation semantics. Simplify by exiting dispatch and move
operations immediately when scx_aborting is set. In consume_dispatch_q(), break
out of the task iteration loop. In scx_dsq_move(), return early before
acquiring locks.
This also fixes cases the breather mechanism cannot handle. When a large system
has many runnable threads affinitized to different CPU subsets and the BPF
scheduler places them all into a single DSQ, many CPUs can scan the DSQ
concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
for extended periods, leading to various failure modes. The breather cannot
solve this because once in the consume loop, there's no exit. The new mechanism
fixes this by exiting the loop immediately.
The bypass DSQ is exempted to ensure the bypass mechanism itself can make
progress.
v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:07 +0000 (09:18 -1000)]
sched_ext: Simplify breather mechanism with scx_aborting flag
The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid
live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the
ops breather and eject BPF scheduler on softlockup") to prevent live-locks by
injecting delays when CPUs are trapped in dispatch paths.
Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup
(unsigned long) with separate increment/decrement and cleanup operations. The
breather is only activated when aborting, so tie it directly to the exit
mechanism. Replace both variables with scx_aborting flag set when exit is
claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to
consolidate exit_kind claiming and breather enablement. This eliminates
scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass().
The breather mechanism will be replaced by a different abort mechanism in a
future patch. This simplification prepares for that change.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:06 +0000 (09:18 -1000)]
sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.
Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.
Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.
This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.
v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:05 +0000 (09:18 -1000)]
sched_ext: Refactor do_enqueue_task() local and global DSQ paths
The local and global DSQ enqueue paths in do_enqueue_task() share the same
slice refill logic. Factor out the common code into a shared enqueue label.
This makes adding new enqueue cases easier. No functional changes.
Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 11 Nov 2025 19:18:04 +0000 (09:18 -1000)]
sched_ext: Use shorter slice in bypass mode
There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.
Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make the bypass slice value
tunable through the slice_bypass_us module parameter (adjustable between 100us
and 100ms) to make it easier to test whether slice durations are a factor in
problem cases.
v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan).
Tejun Heo [Wed, 5 Nov 2025 22:03:08 +0000 (12:03 -1000)]
sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
The warned bitfields in struct scx_sched are updated racily from concurrent
CPUs causing RMW races, which is fine for these boolean warning flags. Add a
comment marking this area to prevent future fields that can't tolerate racy
updates from being added here.
Tejun Heo [Tue, 4 Nov 2025 21:42:55 +0000 (11:42 -1000)]
sched_ext: Minor cleanups to scx_task_iter
- Use memset() in scx_task_iter_start() instead of zeroing fields individually.
- In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch
check which is simpler.
- Update comment to reflect that tasks are removed from scx_tasks when dead
(commit 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving
sched_ext_free() to finish_task_switch()")).
Tejun Heo [Tue, 4 Nov 2025 21:40:22 +0000 (11:40 -1000)]
sched_ext: Move __SCX_DSQ_ITER_ALL_FLAGS BUILD_BUG_ON to the right place
The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't
overlap with the private lnode bits was in scx_task_iter_start() which has
nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it
belongs.
Tejun Heo [Mon, 3 Nov 2025 20:25:13 +0000 (10:25 -1000)]
sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Mon, 3 Nov 2025 21:57:11 +0000 (11:57 -1000)]
sched_ext: Merge branch 'for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19
Pull cgroup/for-6.19 to receive:
16dad7801aad ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()") 260fbcb92bbe ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()") d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
These are needed for the sched_ext cgroup exit ordering fix.
Tejun Heo [Wed, 29 Oct 2025 06:19:17 +0000 (20:19 -1000)]
cgroup: Defer task cgroup unlink until after the task is done switching out
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.
However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.
Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.
This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.
Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Wed, 29 Oct 2025 06:19:16 +0000 (20:19 -1000)]
cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
Currently, cgroup_task_exit() adds thread group leaders with live member
threads to their css_set's dying_tasks list (so cgroup.procs iteration can
still see the leader), and cgroup_task_release() later removes them with
list_del_init(&task->cg_list).
An upcoming patch will defer the dying_tasks list addition, moving it from
cgroup_task_exit() (called from do_exit()) to a new function called from
finish_task_switch(). However, release_task() (which calls
cgroup_task_release()) can run either before or after finish_task_switch(),
creating a race where cgroup_task_release() might try to remove the task from
dying_tasks before or while it's being added.
Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
fix this race. cgroup_task_free() runs from __put_task_struct(), which is
always after both paths, making the cleanup safe.
Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Wed, 29 Oct 2025 06:19:15 +0000 (20:19 -1000)]
cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.
Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Wed, 29 Oct 2025 18:35:25 +0000 (08:35 -1000)]
sched_ext/tools: Restore backward compat with v6.12 kernels
Commit 111a79800aed ("tools/sched_ext: Strip compatibility macros for
cgroup and dispatch APIs") removed the compat layer for v6.12-v6.13 kfunc
renaming, but v6.12 is the current LTS kernel and will remain supported
through 2026. Restore backward compatibility so schedulers built with v6.19+
headers can run on v6.12 kernels.
The restored compat differs from the original in two ways:
1. Uses ___new/___old suffixes instead of ___compat for clarity. The new
macros check for v6.13+ names (scx_bpf_dsq_move*), fall back to v6.12
names (scx_bpf_dispatch_from_dsq*, scx_bpf_consume), then return safe
no-ops for missing symbols.
2. Integrates with the args-struct-packing changes added in c0d630ba347c
("sched_ext: Wrap kfunc args in struct to prepare for aux__prog").
scx_bpf_dsq_insert_vtime() now tries __scx_bpf_dsq_insert_vtime (args
struct), then scx_bpf_dsq_insert_vtime___compat (v6.13-v6.18), then
scx_bpf_dispatch_vtime___compat (pre-v6.13).
Forward compatibility is not restored - binaries built against v6.13 or
earlier headers won't run on v6.19+ kernels, as the old kfunc names are not
exported. This is acceptable since the priority is new binaries running on
older kernels.
Also add missing compat checks for ops.cgroup_set_bandwidth() (added v6.17)
and ops.cgroup_set_idle() (added v6.19). These need to be NULLed out in
userspace on older kernels.
Reported-by: Andrea Righi <arighi@nvidia.com> Acked-and-tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Mon, 27 Oct 2025 18:19:40 +0000 (08:19 -1000)]
sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:
1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
can enqueue a task, but during consume_remote_task() when the rq lock is
released, a higher priority task can be enqueued and ultimately picked while
cpu_released remains false. This gap is closeable via RETRY_TASK handling.
2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
DSQ. By the time the sched path runs on the target CPU, higher class tasks may
already be queued. In such cases, nothing on sched_ext side will be invoked,
and the only solution would be a hook invoked regardless of sched class, which
isn't desirable.
Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.
The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.
Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.
Tejun Heo [Sat, 25 Oct 2025 00:18:48 +0000 (14:18 -1000)]
sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
BPF kfunc checks and calls reenq_local() to perform the actual work.
This is a prep patch to allow reenq_local() to be called from other contexts.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Sat, 25 Oct 2025 00:18:47 +0000 (14:18 -1000)]
sched_ext: Split schedule_deferred() into locked and unlocked variants
schedule_deferred() currently requires the rq lock to be held so that it can
use scheduler hooks for efficiency when available. However, there are cases
where deferred actions need to be scheduled from contexts that don't hold the
rq lock.
Split into schedule_deferred() which can be called from any context and just
queues irq_work, and schedule_deferred_locked() which requires the rq lock and
can optimize by using scheduler hooks when available. Update the existing call
site to use the _locked variant.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Wed, 29 Oct 2025 15:18:13 +0000 (05:18 -1000)]
Merge branch 'for-6.18-fixes' into for-6.19
Pull to receive f4fa7c25f632 ("sched_ext: Fix use of uninitialized variable
in scx_bpf_cpuperf_set()") which conflicts with changes for planned
sub-sched changes.
Andrea Righi [Wed, 29 Oct 2025 13:08:43 +0000 (14:08 +0100)]
sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
scx_bpf_cpuperf_set() has a typo where it dereferences the local
variable @sch, instead of the global @scx_root pointer. Fix by
dereferencing the correct variable.
Fixes: 956f2b11a8a4f ("sched_ext: Drop kf_cpu_valid()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 28 Oct 2025 21:38:34 +0000 (11:38 -1000)]
sched_ext: Use SCX_TASK_READY test instead of tryget_task_struct() during class switch
ddf7233fcab6 ("sched/ext: Fix invalid task state transitions on class
switch") added tryget_task_struct() test during scx_enable()'s class
switching loop. The reason for the addition was to avoid enabling tasks which
skipped prep in the previous loop due to being dead.
While tryget_task_struct() does work for this purpose as tasks that fail
tryget always will fail it, it's a bit roundabout. A more direct way is
testing whether the task is in READY state. Switch to testing SCX_TASK_READY
directly.
Cc: Andrea Righi <arighi@nvidia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Fri, 24 Oct 2025 23:33:50 +0000 (13:33 -1000)]
sched_ext: Add ___compat suffix to scx_bpf_dsq_insert___v2 in compat.bpf.h
2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary
compatibility") renamed the new bool-returning variant to scx_bpf_dsq_insert___v2
in the kernel. However, libbpf currently only strips ___SUFFIX on the BPF side,
not on kernel symbols, so the compat wrapper couldn't match the kernel kfunc and
would always fall back to the old variant even when the new one was available.
Add an extra ___compat suffix as a workaround - libbpf strips one suffix on the
BPF side leaving ___v2, which then matches the kernel kfunc directly. In the
future when libbpf strips all suffixes on both sides, all suffixes can be
dropped.
Andrea Righi [Fri, 24 Oct 2025 22:01:02 +0000 (00:01 +0200)]
sched_ext: Fix scx_bpf_dsq_peek() with FIFO DSQs
When removing a task from a FIFO DSQ, we must delete it from the list
before updating dsq->first_task, otherwise the following lookup will
just re-read the same task, leaving first_task pointing to removed
entry.
This issue only affects DSQs operating in FIFO mode, as priority DSQs
correctly update the rbtree before re-evaluating the new first task.
Remove the item from the list before refreshing the first task to
guarantee the correct behavior in FIFO DSQs.
Fixes: 44f5c8ec5b9ad ("sched_ext: Add lockless peek operation for DSQs") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
The find_user_dsq() function is called from contexts that are already
under RCU read lock protection. Switch from rhashtable_lookup_fast() to
rhashtable_lookup() to avoid redundant RCU locking.
Requires: bee8a520eb84 ("rhashtable: Use rcu_dereference_all and rcu_dereference_all_check") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Wed, 22 Oct 2025 20:56:29 +0000 (10:56 -1000)]
sched_ext: Rename pnt_seq to kick_sync
The pnt_seq field and related infrastructure were originally named for
"pick next task sequence", reflecting their original implementation in
scx_next_task_picked(). However, the sequence counter is now incremented in
both put_prev_task_scx() and pick_task_scx() and its purpose is to
synchronize kick operations via SCX_KICK_WAIT, not specifically to track
pick_next_task events.
Rename to better reflect the actual semantics:
- pnt_seq -> kick_sync
- scx_kick_pseqs -> scx_kick_syncs
- pseqs variables -> ksyncs
- Update comments to refer to "kick_sync sequence" instead of "pick_task
sequence"
This is a pure renaming with no functional changes.
Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Wed, 22 Oct 2025 20:56:28 +0000 (10:56 -1000)]
sched_ext: Fix SCX_KICK_WAIT to work reliably
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.
This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.
However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.
This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and also increments it in
pick_task_scx() to handle cases where the same task is re-selected, whether
by BPF scheduler decision or slice refill. The semantics: If the current
task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the
scheduling path. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.
v2: - Also increment pnt_seq in pick_task_scx() to handle same-task
re-selection (Andrea Righi).
- Use smp_cond_load_acquire() for the busy-wait loop for better
architecture optimization (Peter Zijlstra).
Tejun Heo [Wed, 22 Oct 2025 20:56:27 +0000 (10:56 -1000)]
sched_ext: Don't kick CPUs running higher classes
When a sched_ext scheduler tries to kick a CPU, the CPU may be running a
higher class task. sched_ext has no control over such CPUs. A sched_ext
scheduler couldn't have expected to get access to the CPU after kicking it
anyway. Skip kicking when the target CPU is running a higher class.
Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old
void-returning version to scx_bpf_dsq_insert___compat, with the expectation
that libbpf would match old binaries to the ___compat variant, maintaining
backward binary compatibility. However, while libbpf ignores ___suffix on
the BPF side when matching symbols, it doesn't do so for kernel-side symbols.
Old binaries compiled with the original scx_bpf_dsq_insert() could no longer
resolve the symbol.
Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old
void-returning interface and add ___v2 to the new bool-returning version.
This allows old binaries to continue working while new code can use the
___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the
___v2 suffix can be dropped when the compat interface is removed.
Waiman Long [Mon, 20 Oct 2025 02:32:06 +0000 (22:32 -0400)]
cgroup/cpuset: Don't track # of local child partitions
The cpuset structure has a nr_subparts field which tracks the number
of child local partitions underneath a particular cpuset. Right now,
nr_subparts is only used in partition_is_populated() to avoid iteration
of child cpusets if the condition is right. So by always performing the
child iteration, we can avoid tracking the number of child partitions
and simplify the code a bit.
Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Linus Torvalds [Sun, 19 Oct 2025 14:59:43 +0000 (04:59 -1000)]
Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Make sure the check for lost pelt idle time is done unconditionally
to have correct lost idle time accounting
- Stop the deadline server task before a CPU goes offline
* tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix pelt lost idle time detection
sched/deadline: Stop dl_server before CPU goes offline
Linus Torvalds [Sun, 19 Oct 2025 14:54:08 +0000 (04:54 -1000)]
Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Make sure perf reporting works correctly in setups using
overlayfs or FUSE
- Move the uprobe optimization to a better location logically
* tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix MMAP2 event device with backing files
perf/core: Fix MMAP event path names with backing files
perf/core: Fix address filter match with backing files
uprobe: Move arch_uprobe_optimize right after handlers execution
Linus Torvalds [Sun, 19 Oct 2025 14:41:27 +0000 (04:41 -1000)]
Merge tag 'x86_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Reset the why-the-system-rebooted register on AMD to avoid stale bits
remaining from previous boots
- Add a missing barrier in the TLB flushing code to prevent erroneously
not flushing a TLB generation
- Make sure cpa_flush() does not overshoot when computing the end range
of a flush region
- Fix resctrl bandwidth counting on AMD systems when the amount of
monitoring groups created exceeds the number the hardware can track
* tag 'x86_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Prevent reset reasons from being retained across reboot
x86/mm: Fix SMP ordering in switch_mm_irqs_off()
x86/mm: Fix overflow in __cpa_addr()
x86/resctrl: Fix miscount of bandwidth event when reactivating previously unavailable RMID
Linus Torvalds [Sat, 18 Oct 2025 20:05:13 +0000 (10:05 -1000)]
Merge tag 'rust-rustfmt' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull rustfmt fixes from Miguel Ojeda:
"Rust 'rustfmt' cleanup
'rustfmt', by default, formats imports in a way that is prone to
conflicts while merging and rebasing, since in some cases it condenses
several items into the same line.
Document in our guidelines that we will handle this for the moment
with the trailing empty comment workaround and make the tree
'rustfmt'-clean again"
* tag 'rust-rustfmt' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: bitmap: fix formatting
rust: cpufreq: fix formatting
rust: alloc: employ a trailing comment to keep vertical layout
docs: rust: add section on imports formatting
Linus Torvalds [Sat, 18 Oct 2025 18:38:28 +0000 (08:38 -1000)]
Merge tag 'tpmdd-next-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fix from Jarkko Sakkinen:
"Correct the state transitions for ARM FF-A to match the spec and how
tpm_crb behaves on other platforms"
* tag 'tpmdd-next-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm_crb: Add idle support for the Arm FF-A start method
Linus Torvalds [Sat, 18 Oct 2025 18:35:09 +0000 (08:35 -1000)]
Merge tag 'pci-v6.18-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Search for MSI Capability with correct ID to fix an MSI regression on
platforms with Cadence IP (Hans Zhang)
- Revert early bridge resource set up to fix resource assignment
failures that broke at least alpha boot and Snapdragon ath12k WiFi
(Ilpo Järvinen)
- Implement VMD .irq_startup()/.irq_shutdown() to fix IRQ issues that
caused boot crashes and broken devices below VMD (Inochi Amaoto)
- Select CONFIG_SCREEN_INFO on X86 to fix black screen on boot when
SCREEN_INFO not selected (Mario Limonciello)
* tag 'pci-v6.18-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/VGA: Select SCREEN_INFO on X86
PCI: vmd: Override irq_startup()/irq_shutdown() in vmd_init_dev_msi_info()
PCI: Revert early bridge resource set up
PCI: cadence: Search for MSI Capability with correct ID
Linus Torvalds [Sat, 18 Oct 2025 18:22:07 +0000 (08:22 -1000)]
Merge tag 'cxl-fixes-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link fixes from Dave Jiang:
"A small collection of CXL fixes. In addition to some misc fixes for
the CXL subsystem, a number of fixes for CXL extended linear cache
support are included to make it functional again.
- Avoid missing port component registers setup due to dport
enumeration failure
- Add check for no entries in cxl_feature_info to address accessing
invalid pointer.
- Use %pa printk format to emit resource_size_t in
validate_region_offset()
CXL extended linear cache support fixes:
- Fix setup of memory resource in cxl_acpi_set_cache_size()
- Set range param for region_res_match_cxl_range() as const
(addresses a compile warning for match_region_by_range() fix)
- Fix match_region_by_range() to use region_res_match_cxl_range()
- Subtract to find an hpa_alias0 in cxl_poison events to correct the
alias math calculation"
* tag 'cxl-fixes-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/trace: Subtract to find an hpa_alias0 in cxl_poison events
cxl/region: Use %pa printk format to emit resource_size_t
cxl: Fix match_region_by_range() to use region_res_match_cxl_range()
cxl: Set range param for region_res_match_cxl_range() as const
cxl/acpi: Fix setup of memory resource in cxl_acpi_set_cache_size()
cxl/features: Add check for no entries in cxl_feature_info
cxl/port: Avoid missing port component registers setup
Linus Torvalds [Sat, 18 Oct 2025 18:00:43 +0000 (08:00 -1000)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Replace bpf_map_kmalloc_node() with kmalloc_nolock() to fix kmemleak
imbalance in tracking of bpf_async_cb structures (Alexei Starovoitov)
- Make selftests/bpf arg_parsing.c more robust to errors (Andrii
Nakryiko)
- Fix redefinition of 'off' as different kind of symbol when I40E
driver is builtin (Brahmajit Das)
- Do not disable preemption in bpf_test_run (Sahil Chandna)
- Fix memory leak in __lookup_instance error path (Shardul Bankar)
- Ensure test data is flushed to disk before reading it (Xing Guo)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Fix redefinition of 'off' as different kind of symbol
bpf: Do not disable preemption in bpf_test_run().
bpf: Fix memory leak in __lookup_instance error path
selftests: arg_parsing: Ensure data is flushed to disk before reading.
bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures.
selftests/bpf: make arg_parsing.c more robust to crashes
bpf: test_run: Fix ctx leak in bpf_prog_test_run_xdp error path
Linus Torvalds [Sat, 18 Oct 2025 17:23:59 +0000 (07:23 -1000)]
Merge tag 'exfat-for-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fixes from Namjae Jeon:
- Fix out-of-bounds in FS_IOC_SETFSLABEL
- Add validation for stream entry size to prevent infinite loop
* tag 'exfat-for-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix out-of-bounds in exfat_nls_to_ucs2()
exfat: fix improper check of dentry.stream.valid_size
Linus Torvalds [Sat, 18 Oct 2025 17:18:48 +0000 (07:18 -1000)]
Merge tag 'nfs-for-6.18-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
- Fix for FlexFiles mirror->dss allocation
- Apply delay_retrans to async operations
- Check if suid/sgid is cleared after a write when needed
- Fix setting the state renewal timer for early mounts after a reboot
* tag 'nfs-for-6.18-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS4: Fix state renewals missing after boot
NFS: check if suid/sgid was cleared after a write as needed
NFS4: Apply delay_retrans to async operations
NFSv4/flexfiles: fix to allocate mirror->dss before use
Linus Torvalds [Sat, 18 Oct 2025 17:11:32 +0000 (07:11 -1000)]
Merge tag '6.18-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"smb client fixes, security and smbdirect improvements, and some minor cleanup:
- Important OOB DFS fix
- Fix various potential tcon refcount leaks
- smbdirect (RDMA) fixes (following up from test event a few weeks
ago):
- Fixes to improve and simplify handling of memory lifetime of
smbdirect_mr_io structures, when a connection gets disconnected
- Make sure we really wait to reach SMBDIRECT_SOCKET_DISCONNECTED
before destroying resources
- Make sure the send/recv submission/completion queues are large
enough to avoid ib_post_send() from failing under pressure
- convert cifs.ko to use the recommended crypto libraries (instead of
crypto_shash), this also can improve performance
- Three small cleanup patches"
* tag '6.18-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (24 commits)
smb: client: Consolidate cmac(aes) shash allocation
smb: client: Remove obsolete crypto_shash allocations
smb: client: Use HMAC-MD5 library for NTLMv2
smb: client: Use MD5 library for SMB1 signature calculation
smb: client: Use MD5 library for M-F symlink hashing
smb: client: Use HMAC-SHA256 library for SMB2 signature calculation
smb: client: Use HMAC-SHA256 library for key generation
smb: client: Use SHA-512 library for SMB3.1.1 preauth hash
cifs: parse_dfs_referrals: prevent oob on malformed input
smb: client: Fix refcount leak for cifs_sb_tlink
smb: client: let smbd_destroy() wait for SMBDIRECT_SOCKET_DISCONNECTED
smb: move some duplicate definitions to common/cifsglob.h
smb: client: let destroy_mr_list() keep smbdirect_mr_io memory if registered
smb: client: let destroy_mr_list() call ib_dereg_mr() before ib_dma_unmap_sg()
smb: client: call ib_dma_unmap_sg if mr->sgt.nents is not 0
smb: client: improve logic in smbd_deregister_mr()
smb: client: improve logic in smbd_register_mr()
smb: client: improve logic in allocate_mr_list()
smb: client: let destroy_mr_list() remove locked from the list
smb: client: let destroy_mr_list() call list_del(&mr->list)
...
Linus Torvalds [Sat, 18 Oct 2025 17:07:14 +0000 (07:07 -1000)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix the handling of ZCR_EL2 in NV VMs
- Pick the correct translation regime when doing a PTW on the back of
a SEA
- Prevent userspace from injecting an event into a vcpu that isn't
initialised yet
- Move timer save/restore to the sysreg handling code, fixing EL2
timer access in the process
- Add FGT-based trapping of MDSCR_EL1 to reduce the overhead of debug
- Fix trapping configuration when the host isn't GICv3
- Improve the detection of HCR_EL2.E2H being RES1
- Drop a spurious 'break' statement in the S1 PTW
- Don't try to access SPE when owned by EL3
Documentation updates:
- Document the failure modes of event injection
- Document that a GICv3 guest can be created on a GICv5 host with
FEAT_GCIE_LEGACY
Selftest improvements:
- Add a selftest for the effective value of HCR_EL2.AMO
- Address build warning in the timer selftest when building with
clang
- Teach irqfd selftests about non-x86 architectures
- Add missing sysregs to the set_id_regs selftest
- Fix vcpu allocation in the vgic_lpi_stress selftest
- Correctly enable interrupts in the vgic_lpi_stress selftest
x86:
- Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test
for the bug fixed by commit 3ccbf6f47098 ("KVM: x86/mmu: Return
-EAGAIN if userspace deletes/moves memslot during prefault")
- Don't try to get PMU capabilities from perf when running a CPU with
hybrid CPUs/PMUs, as perf will rightly WARN.
guest_memfd:
- Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a
more generic KVM_CAP_GUEST_MEMFD_FLAGS
- Add a guest_memfd INIT_SHARED flag and require userspace to
explicitly set said flag to initialize memory as SHARED,
irrespective of MMAP.
The behavior merged in 6.18 is that enabling mmap() implicitly
initializes memory as SHARED, which would result in an ABI
collision for x86 CoCo VMs as their memory is currently always
initialized PRIVATE.
- Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with
private memory, to enable testing such setups, i.e. to hopefully
flush out any other lurking ABI issues before 6.18 is officially
released.
- Add testcases to the guest_memfd selftest to cover guest_memfd
without MMAP, and host userspace accesses to mmap()'d private
memory"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (46 commits)
arm64: Revamp HCR_EL2.E2H RES1 detection
KVM: arm64: nv: Use FGT write trap of MDSCR_EL1 when available
KVM: arm64: Compute per-vCPU FGTs at vcpu_load()
KVM: arm64: selftests: Fix misleading comment about virtual timer encoding
KVM: arm64: selftests: Add an E2H=0-specific configuration to get_reg_list
KVM: arm64: selftests: Make dependencies on VHE-specific registers explicit
KVM: arm64: Kill leftovers of ad-hoc timer userspace access
KVM: arm64: Fix WFxT handling of nested virt
KVM: arm64: Move CNT*CT_EL0 userspace accessors to generic infrastructure
KVM: arm64: Move CNT*_CVAL_EL0 userspace accessors to generic infrastructure
KVM: arm64: Move CNT*_CTL_EL0 userspace accessors to generic infrastructure
KVM: arm64: Add timer UAPI workaround to sysreg infrastructure
KVM: arm64: Make timer_set_offset() generally accessible
KVM: arm64: Replace timer context vcpu pointer with timer_id
KVM: arm64: Introduce timer_context_to_vcpu() helper
KVM: arm64: Hide CNTHV_*_EL2 from userspace for nVHE guests
Documentation: KVM: Update GICv3 docs for GICv5 hosts
KVM: arm64: gic-v3: Only set ICH_HCR traps for v2-on-v3 or v3 guests
KVM: arm64: selftests: Actually enable IRQs in vgic_lpi_stress
KVM: arm64: selftests: Allocate vcpus with correct size
...
Linus Torvalds [Sat, 18 Oct 2025 17:02:28 +0000 (07:02 -1000)]
Merge tag 'powerpc-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Fix to handle NULL pointer dereference at irq domain teardown
- Fix for handling extraction of struct xive_irq_data
- Fix to skip parameter area allocation when fadump disabled
Thanks to Ganesh Goudar, Hari Bathini, Nam Cao, Ritesh Harjani (IBM),
Sourabh Jain, and Venkat Rao Bagalkote,
* tag 'powerpc-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/fadump: skip parameter area allocation when fadump is disabled
powerpc, ocxl: Fix extraction of struct xive_irq_data
powerpc/pseries/msi: Fix NULL pointer dereference at irq domain teardown
Linus Torvalds [Sat, 18 Oct 2025 16:59:25 +0000 (06:59 -1000)]
Merge tag 'slab-for-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fixes for two bugs that can be triggered when debugging options are
enabled (Hao Ge, Vlastimil Babka)
* tag 'slab-for-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: reset slab->obj_ext when freeing and it is OBJEXTS_ALLOC_FAIL
slab: fix clearing freelist in free_deferred_objects()
Stuart Yoder [Sat, 18 Oct 2025 11:25:18 +0000 (14:25 +0300)]
tpm_crb: Add idle support for the Arm FF-A start method
According to the CRB over FF-A specification [1], a TPM that implements
the ABI must comply with the TCG PTP specification. This requires support
for the Idle and Ready states.
This patch implements CRB control area requests for goIdle and
cmdReady on FF-A based TPMs.
The FF-A message used to notify the TPM of CRB updates includes a
locality parameter, which provides a hint to the TPM about which
locality modified the CRB. This patch adds a locality parameter
to __crb_go_idle() and __crb_cmd_ready() to support this.
Paolo Bonzini [Sat, 18 Oct 2025 08:25:43 +0000 (10:25 +0200)]
Merge tag 'kvm-x86-fixes-6.18-rc2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 fixes for 6.18:
- Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the
bug fixed by commit 3ccbf6f47098 ("KVM: x86/mmu: Return -EAGAIN if userspace
deletes/moves memslot during prefault")
- Don't try to get PMU capabbilities from perf when running a CPU with hybrid
CPUs/PMUs, as perf will rightly WARN.
- Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a more
generic KVM_CAP_GUEST_MEMFD_FLAGS
- Add a guest_memfd INIT_SHARED flag and require userspace to explicitly set
said flag to initialize memory as SHARED, irrespective of MMAP. The
behavior merged in 6.18 is that enabling mmap() implicitly initializes
memory as SHARED, which would result in an ABI collision for x86 CoCo VMs
as their memory is currently always initialized PRIVATE.
- Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with private
memory, to enable testing such setups, i.e. to hopefully flush out any
other lurking ABI issues before 6.18 is officially released.
- Add testcases to the guest_memfd selftest to cover guest_memfd without MMAP,
and host userspace accesses to mmap()'d private memory.
Linus Torvalds [Fri, 17 Oct 2025 23:04:21 +0000 (13:04 -1000)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Explicitly encode the XZR register if the value passed to
write_sysreg_s() is 0.
The GIC CDEOI instruction is encoded as a system register write with
XZR as the source register. However, clang does not honour the "Z"
register constraint, leading to incorrect code generation
- Ensure the interrupts (DAIF.IF) are unmasked when completing
single-step of a suspended breakpoint before calling
exit_to_user_mode().
With pseudo-NMIs, interrupts are (additionally) masked at the PMR_EL1
register, handled by local_irq_*()
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: debug: always unmask interrupts in el0_softstp()
arm64/sysreg: Fix GIC CDEOI instruction encoding
- Avoid duplicate device probes by avoiding DT hardware probing when
ACPI is enabled in early boot
- Use the correct set of dependencies for
CONFIG_ARCH_HAS_ELF_CORE_EFLAGS, avoiding an allnoconfig warning
- Fix a few other minor issues
* tag 'riscv-for-linux-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: kprobes: convert one final __ASSEMBLY__ to __ASSEMBLER__
riscv: Respect dependencies of ARCH_HAS_ELF_CORE_EFLAGS
riscv: acpi: avoid errors caused by probing DT devices when ACPI is used
riscv: kprobes: Fix probe address validation
riscv: entry: fix typo in comment 'instruciton' -> 'instruction'
RISC-V: clear hot-unplugged cores from all task mm_cpumasks to avoid rfence errors
riscv: kgdb: Ensure that BUFMAX > NUMREGBYTES
rust: cfi: only 64-bit arm and x86 support CFI_CLANG
Brahmajit Das [Fri, 17 Oct 2025 17:15:51 +0000 (22:45 +0530)]
selftests/bpf: Fix redefinition of 'off' as different kind of symbol
This fixes the following build error
CLNG-BPF [test_progs] verifier_global_ptr_args.bpf.o
progs/verifier_global_ptr_args.c:228:5: error: redefinition of 'off' as
different kind of symbol
228 | u32 off;
| ^
The symbol 'off' was previously defined in
tools/testing/selftests/bpf/tools/include/vmlinux.h, which includes an
enum i40e_ptp_gpio_pin_state from
drivers/net/ethernet/intel/i40e/i40e_ptp.c:
This enum is included when CONFIG_I40E is enabled. As of commit 032676ff8217 ("LoongArch: Update Loongson-3 default config file"),
CONFIG_I40E is set in the defconfig, which leads to the conflict.
Renaming the local variable avoids the redefinition and allows the
build to succeed.
Suggested-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Brahmajit Das <listout@listout.xyz> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251017171551.53142-1-listout@listout.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Sahil Chandna [Tue, 14 Oct 2025 18:56:35 +0000 (00:26 +0530)]
bpf: Do not disable preemption in bpf_test_run().
The timer mode is initialized to NO_PREEMPT mode by default,
this disables preemption and force execution in atomic context
causing issue on PREEMPT_RT configurations when invoking
spin_lock_bh(), leading to the following warning:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6107, name: syz.0.17
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 1
Preemption disabled at:
[<ffffffff891fce58>] bpf_test_timer_enter+0xf8/0x140 net/bpf/test_run.c:42
Fix this, by removing NO_PREEMPT/NO_MIGRATE mode check.
Also, the test timer context no longer needs explicit calls to
migrate_disable()/migrate_enable() with rcu_read_lock()/rcu_read_unlock().
Use helpers rcu_read_lock_dont_migrate() and rcu_read_unlock_migrate()
instead.
Reported-by: syzbot+1f1fbecb9413cdbfbef8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1f1fbecb9413cdbfbef8 Suggested-by: Yonghong Song <yonghong.song@linux.dev> Suggested-by: Menglong Dong <menglong.dong@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Tested-by: syzbot+1f1fbecb9413cdbfbef8@syzkaller.appspotmail.com Co-developed-by: Brahmajit Das <listout@listout.xyz> Signed-off-by: Brahmajit Das <listout@listout.xyz> Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Link: https://lore.kernel.org/r/20251014185635.10300-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Ada Couprie Diaz [Tue, 14 Oct 2025 09:25:36 +0000 (10:25 +0100)]
arm64: debug: always unmask interrupts in el0_softstp()
We intend that EL0 exception handlers unmask all DAIF exceptions
before calling exit_to_user_mode().
When completing single-step of a suspended breakpoint, we do not call
local_daif_restore(DAIF_PROCCTX) before calling exit_to_user_mode(),
leaving all DAIF exceptions masked.
When pseudo-NMIs are not in use this is benign.
When pseudo-NMIs are in use, this is unsound. At this point interrupts
are masked by both DAIF.IF and PMR_EL1, and subsequent irq flag
manipulation may not work correctly. For example, a subsequent
local_irq_enable() within exit_to_user_mode_loop() will only unmask
interrupts via PMR_EL1 (leaving those masked via DAIF.IF), and
anything depending on interrupts being unmasked (e.g. delivery of
signals) will not work correctly.
This was detected by CONFIG_ARM64_DEBUG_PRIORITY_MASKING.
Move the call to `try_step_suspended_breakpoints()` outside of the check
so that interrupts can be unmasked even if we don't call the step handler.
Fixes: 0ac7584c08ce ("arm64: debug: split single stepping exception entry") Cc: <stable@vger.kernel.org> # 6.17 Signed-off-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com>
[catalin.marinas@arm.com: added Mark's rewritten commit log and some whitespace] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
The GIC CDEOI system instruction requires the Rt field to be set to 0b11111
otherwise the instruction behaviour becomes CONSTRAINED UNPREDICTABLE.
Currenly, its usage is encoded as a system register write, with a constant
0 value:
write_sysreg_s(0, GICV5_OP_GIC_CDEOI)
While compiling with GCC, the 0 constant value, through these asm
constraints and modifiers ('x' modifier and 'Z' constraint combo):
asm volatile(__msr_s(r, "%x0") : : "rZ" (__val));
forces the compiler to issue the XZR register for the MSR operation (ie
that corresponds to Rt == 0b11111) issuing the right instruction encoding.
Unfortunately LLVM does not yet understand that modifier/constraint
combo so it ends up issuing a different register from XZR for the MSR
source, which in turns means that it encodes the GIC CDEOI instruction
wrongly and the instruction behaviour becomes CONSTRAINED UNPREDICTABLE
that we must prevent.
Add a conditional to write_sysreg_s() macro that detects whether it
is passed a constant 0 value and issues an MSR write with XZR as source
register - explicitly doing what the asm modifier/constraint is meant to
achieve through constraints/modifiers, fixing the LLVM compilation issue.
Fixes: 7ec80fb3f025 ("irqchip/gic-v5: Add GICv5 PPI support") Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Cc: Sascha Bischoff <sascha.bischoff@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Torvalds [Fri, 17 Oct 2025 15:45:54 +0000 (08:45 -0700)]
Merge tag 'io_uring-6.18-20251016' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
- Revert of a change that went into an older kernel, and which has been
reported to cause a regression for some write workloads on LVM while
a snapshop is being created
- Fix a regression from this merge window, where some compilers (and/or
certain .config options) would cause an earlier evaluations of a
dereference which would then cause a NULL pointer dereference.
I was only able to reproduce this with OPTIMIZE_FOR_SIZE=y, but David
Howells hit it with just KASAN enabled. Depending on how things
inlined, this makes sense
- Fix for a missing lock around a mem region unregistration
- Fix for ring resizing with the same placement after resize
* tag 'io_uring-6.18-20251016' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/rw: check for NULL io_br_sel when putting a buffer
io_uring: fix unexpected placement on same size resizing
io_uring: protect mem region deregistration
Revert "io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()"
Linus Torvalds [Fri, 17 Oct 2025 15:31:26 +0000 (08:31 -0700)]
Merge tag 'block-6.18-20251016' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- iostats accounting fixed on multipath retries (Amit)
- secure concatenation response fixup (Martin)
- tls partial record fixup (Wilfred)
- Fix for a lockdep reported issue with the elevator lock and
blk group frozen operations
- Fix for a regression in this merge window, where updating
'nr_requests' would not do the right thing for queues with
shared tags
* tag 'block-6.18-20251016' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
nvme/tcp: handle tls partially sent records in write_space()
block: Remove elevator_lock usage from blkg_conf frozen operations
blk-mq: fix stale tag depth for shared sched tags in blk_mq_update_nr_requests()
nvme-auth: update sc_c in host response
nvme-multipath: Skip nr_active increments in RETRY disposition
Linus Torvalds [Fri, 17 Oct 2025 15:16:58 +0000 (08:16 -0700)]
Merge tag 'drm-fixes-2025-10-17' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"As per usual xe/amdgpu are the leaders, with some i915 and then a
bunch of scattered fixes. There are a bunch of stability fixes for
some older amdgpu cards.
draw:
- Avoid color truncation
gpuvm:
- Avoid kernel-doc warning
sched:
- Avoid double free
i915:
- Skip GuC communication warning if reset is in progress
- Couple frontbuffer related fixes
- Deactivate PSR only on LNL and when selective fetch enabled
xe:
- Increase global invalidation timeout to handle some workloads
- Fix NPD while evicting BOs in an array of VM binds
- Fix resizable BAR to account for possibly needing to move BARs
other than the LMEMBAR
- Fix error handling in xe_migrate_init()
- Fix atomic fault handling with mixed mappings or if the page is
already in VRAM
- Enable media samplers power gating for platforms before Xe2
- Fix de-registering exec queue from GuC when unbinding
- Ensure data migration to system if indicated by madvise with SVM
- Fix kerneldoc for kunit change
- Always account for cacheline alignment on migration
- Drop bogus assertion on eviction
amdgpu:
- Backlight fix
- SI fixes
- CIK fix
- Make CE support debug only
- IP discovery fix
- Ring reset fixes
- GPUVM fault memory barrier fix
- Drop unused structures in amdgpu_drm.h
- JPEG debugfs fix
- VRAM handling fixes for GPUs without VRAM
- GC 12 MES fixes
amdkfd:
- MES fix
ast:
- Fix display output after reboot
bridge:
- lt9211: Fix version check
panthor:
- Fix MCU suspend
qaic:
- Init bootlog in correct order
- Treat remaining == 0 as error in find_and_map_user_pages()
- Lock access to DBC request queue
rockchip:
- vop2: Fix destination size in atomic check"
* tag 'drm-fixes-2025-10-17' of https://gitlab.freedesktop.org/drm/kernel: (44 commits)
drm/sched: Fix potential double free in drm_sched_job_add_resv_dependencies
drm/xe/evict: drop bogus assert
drm/xe/migrate: don't misalign current bytes
drm/xe/kunit: Fix kerneldoc for parameterized tests
drm/xe/svm: Ensure data will be migrated to system if indicated by madvise.
drm/gpuvm: Fix kernel-doc warning for drm_gpuvm_map_req.map
drm/i915/psr: Deactivate PSR only on LNL and when selective fetch enabled
drm/ast: Blank with VGACR17 sync enable, always clear VGACRB6 sync off
accel/qaic: Synchronize access to DBC request queue head & tail pointer
accel/qaic: Treat remaining == 0 as error in find_and_map_user_pages()
accel/qaic: Fix bootlog initialization ordering
drm/rockchip: vop2: use correct destination rectangle height check
drm/draw: fix color truncation in drm_draw_fill24
drm/xe/guc: Check GuC running state before deregistering exec queue
drm/xe: Enable media sampler power gating
drm/xe: Handle mixed mappings and existing VRAM on atomic faults
drm/xe/migrate: Fix an error path
drm/xe: Move rebar to be done earlier
drm/xe: Don't allow evicting of BOs in same VM in array of VM binds
drm/xe: Increase global invalidation timeout to 1000us
...
commit 337bf13aa9dda ("PCI/VGA: Replace vga_is_firmware_default() with a
screen info check") introduced an implicit dependency upon SCREEN_INFO by
removing the open coded implementation.
If a user didn't have CONFIG_SCREEN_INFO set, vga_is_firmware_default()
would now return false. SCREEN_INFO is only used on X86 so add a
conditional select for SCREEN_INFO to ensure that the VGA arbiter works as
intended.
Fixes: 337bf13aa9dda ("PCI/VGA: Replace vga_is_firmware_default() with a screen info check") Reported-by: Eric Biggers <ebiggers@kernel.org> Closes: https://lore.kernel.org/linux-pci/20251012182302.GA3412@sol/ Suggested-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Link: https://patch.msgid.link/20251013220829.1536292-1-superm1@kernel.org
Inochi Amaoto [Tue, 14 Oct 2025 01:46:07 +0000 (09:46 +0800)]
PCI: vmd: Override irq_startup()/irq_shutdown() in vmd_init_dev_msi_info()
Since commit 54f45a30c0d0 ("PCI/MSI: Add startup/shutdown for per
device domains") set callback irq_startup() and irq_shutdown() of
the struct pci_msi[x]_template, __irq_startup() will always invokes
irq_startup() callback instead of irq_enable() callback overridden
in vmd_init_dev_msi_info(). This will not start the IRQ correctly.
Also override irq_startup()/irq_shutdown() in vmd_init_dev_msi_info(),
so the irq_startup() can invoke the real logic.
Dave Airlie [Thu, 16 Oct 2025 23:39:34 +0000 (09:39 +1000)]
Merge tag 'drm-xe-fixes-2025-10-16' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Increase global invalidation timeout to handle some workloads
(Kenneth Graunke)
- Fix NPD while evicting BOs in an array of VM binds (Matthew Brost)
- Fix resizable BAR to account for possibly needing to move BARs other
than the LMEMBAR (Lucas De Marchi)
- Fix error handling in xe_migrate_init() (Thomas Hellström)
- Fix atomic fault handling with mixed mappings or if the page is
already in VRAM (Matthew Brost)
- Enable media samplers power gating for platforms before Xe2 (Vinay
Belgaumkar)
- Fix de-registering exec queue from GuC when unbinding (Matthew Brost)
- Ensure data migration to system if indicated by madvise with SVM
(Thomas Hellström)
- Fix kerneldoc for kunit change (Matt Roper)
- Always account for cacheline alignment on migration (Matthew Auld)
- Drop bogus assertion on eviction (Matthew Auld)
Miguel Ojeda [Fri, 10 Oct 2025 17:43:49 +0000 (19:43 +0200)]
docs: rust: add section on imports formatting
`rustfmt`, by default, formats imports in a way that is prone to conflicts
while merging and rebasing, since in some cases it condenses several
items into the same line.
For instance, Linus mentioned [1] that the following case:
use crate::{
fmt,
page::AsPageIter,
};
is compressed by `rustfmt` into:
use crate::{fmt, page::AsPageIter};
which is undesirable.
Similarly, `rustfmt` may put several items in the same line even if the
braces span already multiple lines, e.g.:
use kernel::{
acpi, c_str,
device::{property, Core},
of, platform,
};
The options that control the formatting behavior around imports are
generally unstable, and `rustfmt` releases do not allow to use nightly
features, unlike the compiler and other Rust tooling [2].
For the moment, we can introduce a workaround to prevent `rustfmt`
from compressing the example above -- the "trailing empty comment":
use crate::{
fmt,
page::AsPageIter, //
};
which is reminiscent of the trailing comma behavior in other formatters.
We already used empty comments for formatting purposes in the past,
e.g. in commit b9b701fce49a ("rust: clarify the language unstable features
in use").
In addition, `rustfmt` actually reformats with a vertical layout (i.e. it
does not put two items in the same line) when seeing such a comment,
i.e. it doesn't just preserve the formatting, which is good in the sense
that we can use it to easily reformat some imports, since it matches
the style we generally want to have.
A Git merge driver would help (suggested by Gary and Wedson), though
maintainers would need to set it up, the diffs would still be larger
and the formatting rules for imports would remain hard to predict.
Thus document the style that we will follow in the coding guidelines
by introducing a new section and explain how the trailing empty comment
works there too.
We discussed the issue with upstream Rust in our usual Rust <-> Rust
for Linux meeting [3], and there have also been a few other discussions
in parallel in issues [4][5] and Zulip [6]. We will see what happens,
but upstream Rust has already created a subteam of `rustfmt` to try
to overcome the bandwidth issue [7], which is a good signal, and some
organization work has already started (e.g. tracking issues). We will
continue our discussions with them about it.
Dave Airlie [Thu, 16 Oct 2025 20:57:44 +0000 (06:57 +1000)]
Merge tag 'amd-drm-fixes-6.18-2025-10-16' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.18-2025-10-16:
amdgpu:
- Backlight fix
- SI fixes
- CIK fix
- Make CE support debug only
- IP discovery fix
- Ring reset fixes
- GPUVM fault memory barrier fix
- Drop unused structures in amdgpu_drm.h
- JPEG debugfs fix
- VRAM handling fixes for GPUs without VRAM
- GC 12 MES fixes
Dave Airlie [Thu, 16 Oct 2025 20:46:14 +0000 (06:46 +1000)]
Merge tag 'drm-intel-fixes-2025-10-16' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Skip GuC communication warning if reset is in progress (Zhanjun)
- Couple frontbuffer related fixes (Ville)
- Deactivate PSR only on LNL and when selective fetch enabled (Jouni)
Jens Axboe [Thu, 16 Oct 2025 19:25:40 +0000 (13:25 -0600)]
Merge tag 'nvme-6.18-2025-10-16' of git://git.infradead.org/nvme into block-6.18
Pull NVMe fixes from Keith:
"- iostats accounting fixed on multipath retries (Amit)
- secure concatenation response fixup (Martin)
- tls partial record fixup (Wilfred)"
* tag 'nvme-6.18-2025-10-16' of git://git.infradead.org/nvme:
nvme/tcp: handle tls partially sent records in write_space()
nvme-auth: update sc_c in host response
nvme-multipath: Skip nr_active increments in RETRY disposition
Tejun Heo [Thu, 16 Oct 2025 18:45:38 +0000 (08:45 -1000)]
sched_ext: Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19
Pull in tip/sched/core to receive:
50653216e4ff ("sched: Add support to pick functions to take rf") 4c95380701f5 ("sched/ext: Fold balance_scx() into pick_task_scx()")
which will enable clean integration of DL server support among other things.
This conflicts with the following from sched_ext/for-6.18-fixes:
a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
which adds maybe_queue_balance_callback() to balance_scx() which is removed
by 50653216e4ff. Resolve by moving the invocation to pick_task_scx() in the
equivalent location.
Emil Tsalapatis [Thu, 16 Oct 2025 18:11:26 +0000 (11:11 -0700)]
sched_ext: fix flag check for deferred callbacks
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING
instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests
whether there is already a pending request for queue_balance_callback() to
be invoked at the end of .balance().
Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Wilfred Mallawa [Fri, 10 Oct 2025 07:19:42 +0000 (17:19 +1000)]
nvme/tcp: handle tls partially sent records in write_space()
With TLS enabled, records that are encrypted and appended to TLS TX
list can fail to see a retry if the underlying TCP socket is busy, for
example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known
to the NVMe TCP driver, as the TLS layer successfully generated a record.
Typically, the TLS write_space() callback would ensure such records are
retried, but in the NVMe TCP Host driver, write_space() invokes
nvme_tcp_write_space(). This causes a partially sent record in the TLS TX
list to timeout after not being retried.
This patch fixes the above by calling queue->write_space(), which calls
into the TLS layer to retry any pending records.
Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Takashi Iwai [Thu, 16 Oct 2025 18:14:24 +0000 (20:14 +0200)]
Merge tag 'asoc-fix-v6.18-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.18
A moderately large collection of driver specific fixes, plus a few new
quirks and device IDs. The NAU8821 changes are a little large but more
in mechanical ways than in ways that are complex.
Shardul Bankar [Thu, 16 Oct 2025 06:33:30 +0000 (12:03 +0530)]
bpf: Fix memory leak in __lookup_instance error path
When __lookup_instance() allocates a func_instance structure but fails
to allocate the must_write_set array, it returns an error without freeing
the previously allocated func_instance. This causes a memory leak of 192
bytes (sizeof(struct func_instance)) each time this error path is triggered.
Fix by freeing 'result' on must_write_set allocation failure.
Fixes: b3698c356ad9 ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
Linus Torvalds [Thu, 16 Oct 2025 17:22:38 +0000 (10:22 -0700)]
Merge tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- in tree-checker fix extref bounds check
- reorder send context structure to avoid
-Wflex-array-member-not-at-end warning
- fix extent readahead length for compressed extents
- fix memory leaks on error paths (qgroup assign ioctl, zone loading
with raid stripe tree enabled)
- fix how device specific mount options are applied, in particular the
'ssd' option will be set unexpectedly
- fix tracking of relocation state when tasks are running and
cancellation is attempted
- adjust assertion condition for folios allocated for scrub
- remove incorrect assertion checking for block group when populating
free space tree
* tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx
btrfs: tree-checker: fix bounds check in check_inode_extref()
btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST
btrfs: fix incorrect readahead expansion length
btrfs: do not assert we found block group item when creating free space tree
btrfs: do not use folio_test_partial_kmap() in ASSERT()s
btrfs: only set the device specific options after devices are opened
btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl
btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running
Linus Torvalds [Thu, 16 Oct 2025 17:16:41 +0000 (10:16 -0700)]
Merge tag 'v6.18-rc1-smb-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix RPC hang due to locking bug
- Fix for memory leak in read and refcount leak (in session setup)
- Minor cleanup
* tag 'v6.18-rc1-smb-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix recursive locking in RPC handle list access
smb/server: fix possible refcount leak in smb2_sess_setup()
smb/server: fix possible memory leak in smb2_read()
smb: server: Use common error handling code in smb_direct_rdma_xmit()
Linus Torvalds [Thu, 16 Oct 2025 16:41:21 +0000 (09:41 -0700)]
Merge tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN
Current release - regressions:
- udp: do not use skb_release_head_state() before
skb_attempt_defer_free()
- gro_cells: use nested-BH locking for gro_cell
- dpll: zl3073x: increase maximum size of flash utility
Previous releases - regressions:
- core: fix lockdep splat on device unregister
- tcp: fix tcp_tso_should_defer() vs large RTT
- tls:
- don't rely on tx_work during send()
- wait for pending async decryptions if tls_strp_msg_hold fails
- can: j1939: add missing calls in NETDEV_UNREGISTER notification
handler
- eth: lan78xx: fix lost EEPROM write timeout in
lan78xx_write_raw_eeprom
Previous releases - always broken:
- ip6_tunnel: prevent perpetual tunnel growth
- dpll: zl3073x: handle missing or corrupted flash configuration
- can: m_can: fix pm_runtime and CAN state handling
- eth:
- ixgbe: fix too early devlink_free() in ixgbe_remove()
- ixgbevf: fix mailbox API compatibility
- gve: Check valid ts bit on RX descriptor before hw timestamping
- idpf: cleanup remaining SKBs in PTP flows
- r8169: fix packet truncation after S4 resume on RTL8168H/RTL8111H"
* tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
udp: do not use skb_release_head_state() before skb_attempt_defer_free()
net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset
netdevsim: set the carrier when the device goes up
selftests: tls: add test for short splice due to full skmsg
selftests: net: tls: add tests for cmsg vs MSG_MORE
tls: don't rely on tx_work during send()
tls: wait for pending async decryptions if tls_strp_msg_hold fails
tls: always set record_type in tls_process_cmsg
tls: wait for async encrypt in case of error during latter iterations of sendmsg
tls: trim encrypted message to match the plaintext on short splice
tg3: prevent use of uninitialized remote_adv and local_adv variables
MAINTAINERS: new entry for IPv6 IOAM
gve: Check valid ts bit on RX descriptor before hw timestamping
net: core: fix lockdep splat on device unregister
MAINTAINERS: add myself as maintainer for b53
selftests: net: check jq command is supported
net: airoha: Take into account out-of-order tx completions in airoha_dev_xmit()
tcp: fix tcp_tso_should_defer() vs large RTT
r8152: add error handling in rtl8152_driver_init
usbnet: Fix using smp_processor_id() in preemptible code warnings
...
Linus Torvalds [Thu, 16 Oct 2025 16:39:29 +0000 (09:39 -0700)]
Merge tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Do not print an error message (and assume that the General Purpose
Log Directory log page is not supported) for a device that reports a
bogus General Purpose Logging Version.
Unsurprisingly, many vendors fail to report the only valid General
Purpose Logging Version (Damien)
* tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: relax checks in ata_read_log_directory()
Xing Guo [Thu, 16 Oct 2025 03:53:30 +0000 (11:53 +0800)]
selftests: arg_parsing: Ensure data is flushed to disk before reading.
test_parse_test_list_file writes some data to
/tmp/bpf_arg_parsing_test.XXXXXX and parse_test_list_file() will read
the data back. However, after writing data to that file, we forget to
call fsync() and it's causing testing failure in my laptop. This patch
helps fix it by adding the missing fsync() call.
The Logitech G502 Hero Wireless's high resolution scrolling resets after
being unplugged without notifying the driver, causing extremely slow
scrolling.
The only indication of this is a battery update packet, so add a quirk to
detect when the device is unplugged and re-enable the scrolling.
Eric Dumazet [Wed, 15 Oct 2025 05:27:15 +0000 (05:27 +0000)]
udp: do not use skb_release_head_state() before skb_attempt_defer_free()
Michal reported and bisected an issue after recent adoption
of skb_attempt_defer_free() in UDP.
The issue here is that skb_release_head_state() is called twice per skb,
one time from skb_consume_udp(), then a second time from skb_defer_free_flush()
and napi_consume_skb().
As Sabrina suggested, remove skb_release_head_state() call from
skb_consume_udp().
Add a DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)) in skb_attempt_defer_free()
Many thanks to Michal, Sabrina, Paolo and Florian for their help.
Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()") Reported-and-bisected-by: Michal Kubecek <mkubecek@suse.cz> Closes: https://lore.kernel.org/netdev/gpjh4lrotyephiqpuldtxxizrsg6job7cvhiqrw72saz2ubs3h@g6fgbvexgl3r/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251015052715.4140493-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hao Ge [Wed, 15 Oct 2025 14:16:42 +0000 (22:16 +0800)]
slab: reset slab->obj_ext when freeing and it is OBJEXTS_ALLOC_FAIL
If obj_exts allocation failed, slab->obj_exts is set to OBJEXTS_ALLOC_FAIL,
But we do not clear it when freeing the slab. Since OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS currently share the same bit position, during the
release of the associated folio, a VM_BUG_ON_FOLIO() check in
folio_memcg_kmem() is triggered because the OBJEXTS_ALLOC_FAIL flag was
not cleared, causing it to be interpreted as a kmem folio (non-slab)
with MEMCG_OBJEXTS_DATA flag set, which is invalid because
MEMCG_OBJEXTS_DATA is supposed to be set only on slabs.
Another problem that predates sharing the OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS bits is that on configurations with
is_check_pages_enabled(), the non-cleared bit in page->memcg_data will
trigger a free_page_is_bad() failure "page still charged to cgroup"
When freeing a slab, we clear slab->obj_exts if the obj_ext array has
been successfully allocated. So let's clear it also when the allocation
has failed.
Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations") Fixes: 7612833192d5 ("slab: Reuse first bit for OBJEXTS_ALLOC_FAIL") Link: https://lore.kernel.org/all/20251015141642.700170-1-hao.ge@linux.dev/ Cc: <stable@vger.kernel.org> Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tvrtko Ursulin [Wed, 15 Oct 2025 08:40:15 +0000 (09:40 +0100)]
drm/sched: Fix potential double free in drm_sched_job_add_resv_dependencies
When adding dependencies with drm_sched_job_add_dependency(), that
function consumes the fence reference both on success and failure, so in
the latter case the dma_fence_put() on the error path (xarray failed to
expand) is a double free.
Interestingly this bug appears to have been present ever since
commit ebd5f74255b9 ("drm/sched: Add dependency tracking"), since the code
back then looked like this:
drm_sched_job_add_implicit_dependencies():
...
for (i = 0; i < fence_count; i++) {
ret = drm_sched_job_add_dependency(job, fences[i]);
if (ret)
break;
}
for (; i < fence_count; i++)
dma_fence_put(fences[i]);
Which means for the failing 'i' the dma_fence_put was already a double
free. Possibly there were no users at that time, or the test cases were
insufficient to hit it.
The bug was then only noticed and fixed after
commit 9c2ba265352a ("drm/scheduler: use new iterator in drm_sched_job_add_implicit_dependencies v2")
landed, with its fixup of
commit 4eaf02d6076c ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies").
At that point it was a slightly different flavour of a double free, which
commit 963d0b356935 ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder")
noticed and attempted to fix.
But it only moved the double free from happening inside the
drm_sched_job_add_dependency(), when releasing the reference not yet
obtained, to the caller, when releasing the reference already released by
the former in the failure case.
As such it is not easy to identify the right target for the fixes tag so
lets keep it simple and just continue the chain.
While fixing we also improve the comment and explain the reason for taking
the reference and not dropping it.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: 963d0b356935 ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/dri-devel/aNFbXq8OeYl3QSdm@stanley.mountain/ Cc: Christian König <christian.koenig@amd.com> Cc: Rob Clark <robdclark@chromium.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Philipp Stanner <phasta@kernel.org> Cc: Christian König <ckoenig.leichtzumerken@gmail.com> Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org # v5.16+ Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://lore.kernel.org/r/20251015084015.6273-1-tvrtko.ursulin@igalia.com