David Wei [Mon, 24 Feb 2025 04:13:19 +0000 (20:13 -0800)]
io_uring/zcrx: add selftest case for recvzc with read limit
Add a selftest case to iou-zcrx where the sender sends 4x4K = 16K and
the receiver does 4x4K recvzc requests. Validate that the requests
complete successfully and that the data is not corrupted.
David Wei [Mon, 24 Feb 2025 04:13:18 +0000 (20:13 -0800)]
io_uring/zcrx: add a read limit to recvzc requests
Currently multishot recvzc requests have no read limit and will remain
active so as long as the socket remains open. But, there are sometimes a
need to do a fixed length read e.g. peeking at some data in the socket.
Add a length limit to recvzc requests `len`. A value of 0 means no limit
which is the previous behaviour. A positive value N specifies how many
bytes to read from the socket.
Data will still be posted in aux completions, as before. This could be
split across multiple frags. But the primary recvzc request will now
complete once N bytes have been read. The completion of the recvzc
request will have res and cflags both set to 0.
Pavel Begunkov [Sun, 23 Feb 2025 17:22:31 +0000 (17:22 +0000)]
io_uring: make io_poll_issue() sturdier
io_poll_issue() forwards the call to io_issue_sqe() and thus inherits
some of the handling. That's not particularly failure resistant, as for
example returning an innocently looking IOU_OK from a multishot issue
will lead to severe bugs.
Reimplement io_poll_issue() without io_issue_sqe()'s request completion
logic. Remove extra checks as we know that req->file is already set,
linked timeout are armed, and iopoll is not supported. Also cover it
with warnings for now.
The patch should be useful by itself, but it's also preparing the
codebase for other future clean ups.
Pavel Begunkov [Sun, 23 Feb 2025 17:22:29 +0000 (17:22 +0000)]
io_uring/net: fix accept multishot handling
REQ_F_APOLL_MULTISHOT doesn't guarantee it's executed from the multishot
context, so a multishot accept may get executed inline, fail
io_req_post_cqe(), and ask the core code to kill the request with
-ECANCELED by returning IOU_STOP_MULTISHOT even when a socket has been
accepted and installed.
Compat performance is not important and simplicity is more appreciated.
Let's not be smart about it and use simpler copy_from_user() instead of
access + __get_user pair.
Pavel Begunkov [Mon, 24 Feb 2025 12:42:21 +0000 (12:42 +0000)]
io_uring/rw: compile out compat param passing
Even when COMPAT is compiled out, we still have to pass
ctx->compat to __import_iovec(). Replace the read with an indirection
with a constant when the kernel doesn't support compat.
Jens Axboe [Thu, 20 Feb 2025 15:00:27 +0000 (08:00 -0700)]
Merge branch 'for-6.15/block' into for-next
* for-6.15/block:
Revert "driver: block: release the lo_work_lock before queue_work"
block: refactor rq_qos_wait()
block: introduce init_wait_func()
loop: release the lo_work_lock before queue_work
blk-crypto: add ioctls to create and prepare hardware-wrapped keys
blk-crypto: show supported key types in sysfs
blk-crypto: add basic hardware-wrapped key support
Jens Axboe [Fri, 31 Jan 2025 21:29:09 +0000 (14:29 -0700)]
io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
For existing epoll event loops that can't fully convert to io_uring,
the used approach is usually to add the io_uring fd to the epoll
instance and use epoll_wait() to wait on both "legacy" and io_uring
events. While this work, it isn't optimal as:
1) epoll_wait() is pretty limited in what it can do. It does not support
partial reaping of events, or waiting on a batch of events.
2) When an io_uring ring is added to an epoll instance, it activates the
io_uring "I'm being polled" logic which slows things down.
Rather than use this approach, with EPOLL_WAIT support added to io_uring,
event loops can use the normal io_uring wait logic for everything, as
long as an epoll wait request has been armed with io_uring.
Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
is an async request. Waiting on io_uring events in general has various
timeout parameters, and those are the ones that should be used when
waiting on any kind of request. If events are immediately available for
reaping, then This opcode will return those immediately. If none are
available, then it will post an async completion when they become
available.
cqe->res will contain either an error code (< 0 value) for a malformed
request, invalid epoll instance, etc. It will return a positive result
indicating how many events were reaped.
IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
cancelation infrastructure.
Jens Axboe [Wed, 19 Feb 2025 17:22:26 +0000 (10:22 -0700)]
eventpoll: add epoll_sendevents() helper
Basic helper that copies ready events to the specified userspace
address. The event checking is quick and racy, it's up to the caller
to ensure it retries appropriately in case 0 events are copied.
Linus Torvalds [Thu, 20 Feb 2025 02:11:28 +0000 (18:11 -0800)]
Merge tag 'mm-hotfixes-stable-2025-02-19-17-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"18 hotfixes. 5 are cc:stable and the remainder address post-6.13
issues or aren't considered necessary for -stable kernels.
10 are for MM and 8 are for non-MM. All are singletons, please see the
changelogs for details"
* tag 'mm-hotfixes-stable-2025-02-19-17-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
test_xarray: fix failure in check_pause when CONFIG_XARRAY_MULTI is not defined
kasan: don't call find_vm_area() in a PREEMPT_RT kernel
MAINTAINERS: update Nick's contact info
selftests/mm: fix check for running THP tests
mm: hugetlb: avoid fallback for specific node allocation of 1G pages
memcg: avoid dead loop when setting memory.max
mailmap: update Nick's entry
mm: pgtable: fix incorrect reclaim of non-empty PTE pages
taskstats: modify taskstats version
getdelays: fix error format characters
mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
tools/mm: fix build warnings with musl-libc
mailmap: add entry for Feng Tang
.mailmap: add entries for Jeff Johnson
mm,madvise,hugetlb: check for 0-length range after end address adjustment
mm/zswap: fix inconsistency when zswap_store_page() fails
lib/iov_iter: fix import_iovec_ubuf iovec management
procfs: fix a locking bug in a vmcore_add_device_dump() error path
Pavel Begunkov [Wed, 19 Feb 2025 10:09:54 +0000 (10:09 +0000)]
io_uring/zcrx: fix leaks on failed registration
If we try to register a device-less interface like veth,
io_register_zcrx_ifq() will leak struct io_zcrx_ifq with a bunch of
resources attached to it. Fix that.
Fixes: 035af94b39fd ("io_uring/zcrx: grab a net device") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202502190532.W7NnmyiP-lkp@intel.com/ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fbf16279dd73fa4c6df048168728355636ba5f53.1739959771.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 19 Feb 2025 01:33:38 +0000 (01:33 +0000)]
io_uring/rw: don't directly use ki_complete
We want to avoid checking ->ki_complete directly in the io_uring
completion path. Fortunately we have only two callback the selection
of which depend on the ring constant flags, i.e. IOPOLL, so use that
to infer the function.
Pavel Begunkov [Wed, 19 Feb 2025 01:33:37 +0000 (01:33 +0000)]
io_uring/rw: forbid multishot async reads
At the moment we can't sanely handle queuing an async request from a
multishot context, so disable them. It shouldn't matter as pollable
files / socekts don't normally do async.
Patching it in __io_read() is not the cleanest way, but it's simpler
than other options, so let's fix it there and clean up on top.
Pavel Begunkov [Wed, 19 Feb 2025 10:08:01 +0000 (10:08 +0000)]
io_uring/zcrx: recheck ifq on shutdown
io_ring_exit_work() checks ifq before shutting it down and guarantees
that the pointer is stable, but instead of relying on rather complicated
synchronisation recheck the ifq pointer inside.
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86
("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the
rsrc_node_cache.
IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused
since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring
io_rsrc_node list") removed the separate tag table for registered nodes.
Jens Axboe [Tue, 18 Feb 2025 23:47:40 +0000 (16:47 -0700)]
io_uring: fix spelling error in uapi io_uring.h
This is obviously not that important, but when changes are synced back
from the kernel to liburing, the codespell CI ends up erroring because
of this misspelling. Let's just correct it and avoid this biting us
again on an import.
Linus Torvalds [Tue, 18 Feb 2025 17:00:31 +0000 (09:00 -0800)]
Merge tag 'sound-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A slightly large collection of fixes, spread over various drivers.
Almost all are small and device-specific fixes and quirks in ASoC SOF
Intel and AMD, Renesas, Cirrus, HD-audio, in addition to a small fix
for MIDI 2.0"
* tag 'sound-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (41 commits)
ALSA: seq: Drop UMP events when no UMP-conversion is set
ALSA: hda/conexant: Add quirk for HP ProBook 450 G4 mute LED
ALSA: hda/cirrus: Reduce codec resume time
ALSA: hda/cirrus: Correct the full scale volume set logic
virtio_snd.h: clarify that `controls` depends on VIRTIO_SND_F_CTLS
ALSA: hda: Add error check for snd_ctl_rename_id() in snd_hda_create_dig_out_ctls()
ALSA: hda/tas2781: Fix index issue in tas2781 hda SPI driver
ASoC: imx-audmix: remove cpu_mclk which is from cpu dai device
ALSA: hda/realtek: Fixup ALC225 depop procedure
ALSA: hda/tas2781: Update tas2781 hda SPI driver
ASoC: cs35l41: Fix acpi_device_hid() not found
ASoC: SOF: amd: Add branch prediction hint in ACP IRQ handler
ASoC: SOF: amd: Handle IPC replies before FW_BOOT_COMPLETE
ASoC: SOF: amd: Drop unused includes from Vangogh driver
ASoC: SOF: amd: Add post_fw_run_delay ACP quirk
ASoC: Intel: soc-acpi-intel-ptl-match: revise typo of rt713_vb_l2_rt1320_l13
ASoC: Intel: soc-acpi-intel-ptl-match: revise typo of rt712_vb + rt1320 support
ALSA: Switch to use hrtimer_setup()
ALSA: hda: hda-intel: add Panther Lake-H support
ASoC: SOF: Intel: pci-ptl: Add support for PTL-H
...
Kemeng Shi [Thu, 13 Feb 2025 16:36:59 +0000 (00:36 +0800)]
test_xarray: fix failure in check_pause when CONFIG_XARRAY_MULTI is not defined
In case CONFIG_XARRAY_MULTI is not defined, xa_store_order can store a
multi-index entry but xas_for_each can't tell sbiling entry from valid
entry. So the check_pause failed when we store a multi-index entry and
wish xas_for_each can handle it normally. Avoid to store multi-index
entry when CONFIG_XARRAY_MULTI is disabled to fix the failure.
Link: https://lkml.kernel.org/r/20250213163659.414309-1-shikemeng@huaweicloud.com Fixes: c9ba5249ef8b ("Xarray: move forward index correctly in xas_pause()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Closes: https://lore.kernel.org/r/CAMuHMdU_bfadUO=0OZ=AoQ9EAmQPA4wsLCBqohXR+QCeCKRn4A@mail.gmail.com Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit e30a0361b851 ("kasan: make report_lock a raw spinlock"),
report_lock was changed to raw_spinlock_t to fix another similar
PREEMPT_RT problem. That alone isn't enough to cover other corner cases.
print_address_description() is always invoked under the report_lock. The
context under this lock is always atomic even on PREEMPT_RT.
find_vm_area() acquires vmap_node::busy.lock which is a spinlock_t,
becoming a sleeping lock on PREEMPT_RT and must not be acquired in atomic
context.
Don't invoke find_vm_area() on PREEMPT_RT and just print the address.
Non-PREEMPT_RT builds remain unchanged. Add a DEFINE_WAIT_OVERRIDE_MAP()
macro to tell lockdep that this lock nesting is allowed because the
PREEMPT_RT part (which is invalid) has been taken care of. This macro was
first introduced in commit 0cce06ba859a ("debugobjects,locking: Annotate
debug_object_fill_pool() wait type violation").
Link: https://lkml.kernel.org/r/20250217204402.60533-1-longman@redhat.com Fixes: e30a0361b851 ("kasan: make report_lock a raw spinlock") Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mariano Pache <npache@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Brown [Wed, 12 Feb 2025 17:44:25 +0000 (17:44 +0000)]
selftests/mm: fix check for running THP tests
When testing if we should try to compact memory or drop caches before we
run the THP or HugeTLB tests we use | as an or operator. This doesn't
work since run_vmtests.sh is written in shell where this is used to pipe
the output of the first argument into the second. Instead use the shell's
-o operator.
Link: https://lkml.kernel.org/r/20250212-kselftest-mm-no-hugepages-v1-1-44702f538522@kernel.org Fixes: b433ffa8dbac ("selftests: mm: perform some system cleanup before using hugepages") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Mariano Pache <npache@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Luiz Capitulino [Tue, 11 Feb 2025 03:48:56 +0000 (22:48 -0500)]
mm: hugetlb: avoid fallback for specific node allocation of 1G pages
When using the HugeTLB kernel command-line to allocate 1G pages from a
specific node, such as:
default_hugepagesz=1G hugepages=1:1
If node 1 happens to not have enough memory for the requested number of 1G
pages, the allocation falls back to other nodes. A quick way to reproduce
this is by creating a KVM guest with a memory-less node and trying to
allocate 1 1G page from it. Instead of failing, the allocation will
fallback to other nodes.
This defeats the purpose of node specific allocation. Also, specific node
allocation for 2M pages don't have this behavior: the allocation will just
fail for the pages it can't satisfy.
This issue happens because HugeTLB calls memblock_alloc_try_nid_raw() for
1G boot-time allocation as this function falls back to other nodes if the
allocation can't be satisfied. Use memblock_alloc_exact_nid_raw()
instead, which ensures that the allocation will only be satisfied from the
specified node.
Link: https://lkml.kernel.org/r/20250211034856.629371-1-luizcap@redhat.com Fixes: b5389086ad7b ("hugetlbfs: extend the definition of hugepages parameter to support node allocation") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The stress test involves CPU hotplug operations and memory control group
(memcg) operations. The scenario can be described as follows:
echo xx > memory.max cache_ap_online oom_reaper
(CPU23) (CPU50)
xx < usage stop_machine_from_inactive_cpu
for(;;) // all active cpus
trigger OOM queue_stop_cpus_work
// waiting oom_reaper
multi_cpu_stop(migration/xx)
// sync all active cpus ack
// waiting cpu23 ack
// CPU50 loops in multi_cpu_stop
waiting cpu50
Detailed explanation:
1. When the usage is larger than xx, an OOM may be triggered. If the
process does not handle with ths kill signal immediately, it will loop
in the memory_max_write.
2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
active cpus. Within the multi_cpu_stop function, it attempts to
synchronize the CPU states. However, the CPU23 didn't acknowledge
because it is stuck in a loop within the for(;;).
3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
for CPU23 to acknowledge the synchronization request.
4. Finally, it formed cyclic dependency and lead to softlockup and dead
loop.
To fix this issue, add cond_resched() in the memory_max_write, so that it
will not block migration task.
Link: https://lkml.kernel.org/r/20250211081819.33307-1-chenridong@huaweicloud.com Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Wang Weiyang <wangweiyang2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 11 Feb 2025 07:26:25 +0000 (15:26 +0800)]
mm: pgtable: fix incorrect reclaim of non-empty PTE pages
In zap_pte_range(), if the pte lock was released midway, the pte entries
may be refilled with physical pages by another thread, which may cause a
non-empty PTE page to be reclaimed and eventually cause the system to
crash.
To fix it, fall back to the slow path in this case to recheck if all pte
entries are still none.
Link: https://lkml.kernel.org/r/20250211072625.89188-1-zhengqi.arch@bytedance.com Fixes: 6375e95f381e ("mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Christian Brauner <brauner@kernel.org> Closes: https://lore.kernel.org/all/20250207-anbot-bankfilialen-acce9d79a2c7@brauner/ Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Closes: https://lore.kernel.org/all/152296f3-5c81-4a94-97f3-004108fba7be@gmx.com/ Tested-by: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Sat, 8 Feb 2025 06:49:01 +0000 (14:49 +0800)]
taskstats: modify taskstats version
After adding "delay max" and "delay min" to the taskstats structure, the
taskstats version needs to be updated.
Link: https://lkml.kernel.org/r/20250208144901218Q5ptVpqsQkb2MOEmW4Ujn@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Sat, 8 Feb 2025 06:44:00 +0000 (14:44 +0800)]
getdelays: fix error format characters
getdelays had a compilation issue because the format string was not
updated when the "delay min" was added. For example, after adding the
"delay min" in printf, there were 7 strings but only 6 "%s" format
specifiers. Similarly, after adding the 't->cpu_delay_total', there were
7 variables but only 6 format characters specifiers, causing compilation
issues as follows. This commit fixes these issues to ensure that
getdelays compiles correctly.
root@xx:~/linux-next/tools/accounting$ make
getdelays.c:199:9: warning: format `%llu' expects argument of type
`long long unsigned int', but argument 8 has type `char *' [-Wformat=]
199 | printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.....
216 | "delay total", "delay average", "delay max", "delay min",
| ~~~~~~~~~~~
| |
| char *
getdelays.c:200:21: note: format string is defined here
200 | " %15llu%15llu%15llu%15llu%15.3fms%13.6fms\n"
| ~~~~~^
| |
| long long unsigned int
| %15s
getdelays.c:199:9: warning: format `%f' expects argument of type
`double', but argument 12 has type `long long unsigned int' [-Wformat=]
199 | printf("\n\nCPU %15s%15s%15s%15s%15s%15s\n"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.....
220 | (unsigned long long)t->cpu_delay_total,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| long long unsigned int
.....
Link: https://lkml.kernel.org/r/20250208144400544RduNRhwIpT3m2JyRBqskZ@zte.com.cn Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak") Reviewed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: ye xingchen <ye.xingchen@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
If migration succeeded, we called
folio_migrate_flags()->mem_cgroup_migrate() to migrate the memcg from the
old to the new folio. This will set memcg_data of the old folio to 0.
Similarly, if migration failed, memcg_data of the dst folio is left unset.
If we call folio_putback_lru() on such folios (memcg_data == 0), we will
add the folio to be freed to the LRU, making memcg code unhappy. Running
the hmm selftests:
Likely, nothing else goes wrong: putting the last folio reference will
remove the folio from the LRU again. So besides memcg complaining, adding
the folio to be freed to the LRU is just an unnecessary step.
The new flow resembles what we have in migrate_folio_move(): add the dst
to the lru, remove migration ptes, unlock and unref dst.
mm,madvise,hugetlb: check for 0-length range after end address adjustment
Add a sanity check to madvise_dontneed_free() to address a corner case in
madvise where a race condition causes the current vma being processed to
be backed by a different page size.
During a madvise(MADV_DONTNEED) call on a memory region registered with a
userfaultfd, there's a period of time where the process mm lock is
temporarily released in order to send a UFFD_EVENT_REMOVE and let
userspace handle the event. During this time, the vma covering the
current address range may change due to an explicit mmap done concurrently
by another thread.
If, after that change, the memory region, which was originally backed by
4KB pages, is now backed by hugepages, the end address is rounded down to
a hugepage boundary to avoid data loss (see "Fixes" below). This rounding
may cause the end address to be truncated to the same address as the
start.
Make this corner case follow the same semantics as in other similar cases
where the requested region has zero length (ie. return 0).
This will make madvise_walk_vmas() continue to the next vma in the range
(this time holding the process mm lock) which, due to the prev pointer
becoming stale because of the vma change, will be the same hugepage-backed
vma that was just checked before. The next time madvise_dontneed_free()
runs for this vma, if the start address isn't aligned to a hugepage
boundary, it'll return -EINVAL, which is also in line with the madvise
api.
From userspace perspective, madvise() will return EINVAL because the start
address isn't aligned according to the new vma alignment requirements
(hugepage), even though it was correctly page-aligned when the call was
issued.
Link: https://lkml.kernel.org/r/20250203075206.1452208-1-rcn@igalia.com Fixes: 8ebe0a5eaaeb ("mm,madvise,hugetlb: fix unexpected data loss with MADV_DONTNEED on hugetlbfs") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Florent Revest <revest@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hyeonggon Yoo [Wed, 29 Jan 2025 10:08:44 +0000 (19:08 +0900)]
mm/zswap: fix inconsistency when zswap_store_page() fails
Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
skips charging any zswap entries when it failed to zswap the entire folio.
However, when some base pages are zswapped but it failed to zswap the
entire folio, the zswap operation is rolled back. When freeing zswap
entries for those pages, zswap_entry_free() uncharges the zswap entries
that were not previously charged, causing zswap charging to become
inconsistent.
This inconsistency triggers two warnings with following steps:
# On a machine with 64GiB of RAM and 36GiB of zswap
$ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
$ sudo reboot
The two warnings are:
in mm/memcontrol.c:163, function obj_cgroup_release():
WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
in mm/page_counter.c:60, function page_counter_cancel():
if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n",
new, nr_pages))
zswap_stored_pages also becomes inconsistent in the same way.
As suggested by Kanchana, increment zswap_stored_pages and charge zswap
entries within zswap_store_page() when it succeeds. This way,
zswap_entry_free() will decrement the counter and uncharge the entries
when it failed to zswap the entire folio.
While this could potentially be optimized by batching objcg charging and
incrementing the counter, let's focus on fixing the bug this time and
leave the optimization for later after some evaluation.
After resolving the inconsistency, the warnings disappear.
[42.hyeyoo@gmail.com: refactor zswap_store_page()] Link: https://lkml.kernel.org/r/20250131082037.2426-1-42.hyeyoo@gmail.com Link: https://lkml.kernel.org/r/20250129100844.2935-1-42.hyeyoo@gmail.com Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
import_iovec() says that it should always be fine to kfree the iovec
returned in @iovp regardless of the error code. __import_iovec_ubuf()
never reallocates it and thus should clear the pointer even in cases when
copy_iovec_*() fail.
Link: https://lkml.kernel.org/r/378ae26923ffc20fd5e41b4360d673bf47b1775b.1738332461.git.asml.silence@gmail.com Fixes: 3b2deb0e46da ("iov_iter: import single vector iovecs as ITER_UBUF") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bart Van Assche [Wed, 29 Jan 2025 22:20:03 +0000 (14:20 -0800)]
procfs: fix a locking bug in a vmcore_add_device_dump() error path
Unlock vmcore_mutex when returning -EBUSY.
Link: https://lkml.kernel.org/r/20250129222003.1495713-1-bvanassche@acm.org Fixes: 0f3b1c40c652 ("fs/proc/vmcore: disallow vmcore modifications while the vmcore is open") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baoquan he <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Mon, 17 Feb 2025 18:38:25 +0000 (10:38 -0800)]
Merge tag 'vfs-6.14-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"It was reported that the acct(2) system call can be used to trigger a
NULL deref in cases where it is set to write to a file that triggers
an internal lookup.
This can e.g., happen when pointing acct(2) to /sys/power/resume. At
the point the where the write to this file happens the calling task
has already exited and called exit_fs() but an internal lookup might
be triggered through lookup_bdev(). This may trigger a NULL-deref when
accessing current->fs.
Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.
Also block access to kernel internal filesystems as well as procfs and
sysfs in the first place.
Various fixes for netfslib:
- Fix a number of read-retry hangs, including:
- Incorrect getting/putting of references on subreqs as we retry
them
- Failure to track whether a last old subrequest in a retried set
is superfluous
- Inconsistency in the usage of wait queues used for subrequests
(ie. using clear_and_wake_up_bit() whilst waiting on a private
waitqueue)
- Add stats counters for retries and publish in /proc/fs/netfs/stats.
This is not a fix per se, but is useful in debugging and shouldn't
otherwise change the operation of the code
- Fix the ordering of queuing subrequests with respect to setting the
request flag that says we've now queued them all"
* tag 'vfs-6.14-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix setting NETFS_RREQ_ALL_QUEUED to be after all subreqs queued
netfs: Add retry stat counters
netfs: Fix a number of read-retry hangs
acct: block access to kernel internal filesystems
acct: perform last write from workqueue
Linus Torvalds [Mon, 17 Feb 2025 18:32:19 +0000 (10:32 -0800)]
Merge tag 'powerpc-6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Couple of patches to fix KASAN failduring boot
- Fix to avoid warnings/errors when building with 4k page size
Thanks to Christophe Leroy, Ritesh Harjani (IBM), and Erhard Furtner
* tag 'powerpc-6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/code-patching: Fix KASAN hit by not flagging text patching area as VM_ALLOC
powerpc/64s: Rewrite __real_pte() and __rpte_to_hidx() as static inline
powerpc/code-patching: Disable KASAN report during patching via temporary mm
Takashi Iwai [Mon, 17 Feb 2025 17:00:30 +0000 (18:00 +0100)]
ALSA: seq: Drop UMP events when no UMP-conversion is set
When a destination client is a user client in the legacy MIDI mode and
it sets the no-UMP-conversion flag, currently the all UMP events are
still passed as-is. But this may confuse the user-space, because the
event packet size is different from the legacy mode.
Since we cannot handle UMP events in user clients unless it's running
in the UMP client mode, we should filter out those events instead of
accepting blindly. This patch addresses it by slightly adjusting the
conditions for UMP event handling at the event delivery time.
Ming Lei [Mon, 17 Feb 2025 03:16:26 +0000 (11:16 +0800)]
block: fix NULL pointer dereferenced within __blk_rq_map_sg
The block layer internal flush request may not have bio attached, so the
request iterator has to be initialized from valid req->bio, otherwise NULL
pointer dereferenced is triggered.
Cc: Christoph Hellwig <hch@lst.de> Reported-and-tested-by: Cheyenne Wills <cheyenne.wills@gmail.com> Fixes: b7175e24d6ac ("block: add a dma mapping iterator") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250217031626.461977-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Wei [Sat, 15 Feb 2025 00:09:46 +0000 (16:09 -0800)]
io_uring/zcrx: add selftest
Add a selftest for io_uring zero copy Rx. This test cannot run locally
and requires a remote host to be configured in net.config. The remote
host must have hardware support for zero copy Rx as listed in the
documentation page. The test will restore the NIC config back to before
the test and is idempotent.
liburing is required to compile the test and be installed on the remote
host running the test.
Pavel Begunkov [Sat, 15 Feb 2025 00:09:44 +0000 (16:09 -0800)]
io_uring/zcrx: add copy fallback
There are scenarios in which the zerocopy path can get a kernel buffer
instead of a net_iov and needs to copy it to the user, whether it is
because of mis-steering or simply getting an skb with the linear part.
In this case, grab a net_iov, copy into it and return it to the user as
normally.
At the moment the user doesn't get any indication whether there was a
copy or not, which is left for follow up work.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-10-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 15 Feb 2025 00:09:43 +0000 (16:09 -0800)]
io_uring/zcrx: throttle receive requests
io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
receive. If the other side sends fast enough, we might get stuck in
io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
user to handle them leading to unbound latencies.
Break out of it based on an arbitrarily chosen limit, the upper layer
will either return to userspace or requeue the request.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-9-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Wei [Sat, 15 Feb 2025 00:09:42 +0000 (16:09 -0800)]
io_uring/zcrx: set pp memory provider for an rx queue
Set the page pool memory provider for the rx queue configured for zero
copy to io_uring. Then the rx queue is reset using
netdev_rx_queue_restart() and netdev core + page pool will take care of
filling the rx queue from the io_uring zero copy memory provider.
For now, there is only one ifq so its destruction happens implicitly
during io_uring cleanup.
David Wei [Sat, 15 Feb 2025 00:09:41 +0000 (16:09 -0800)]
io_uring/zcrx: add io_recvzc request
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a
socket. Only the connection should be land on the specific rx queue set
up for zero copy, and the socket must be handled by the io_uring
instance that the rx queue was registered for zero copy with. That's
because neither net_iovs / buffers from our queue can be read by outside
applications, nor zero copy is possible if traffic for the zero copy
connection goes to another queue. This coordination is outside of the
scope of this patch series. Also, any traffic directed to the zero copy
enabled queue is immediately visible to the application, which is why
CAP_NET_ADMIN is required at the registration step.
Of course, no data is actually read out of the socket, it has already
been copied by the netdev into userspace memory via DMA. OP_RECV_ZC
reads skbs out of the socket and checks that its frags are indeed
net_iovs that belong to io_uring. A cqe is queued for each one of these
frags.
Recall that each cqe is a big cqe, with the top half being an
io_uring_zcrx_cqe. The cqe res field contains the len or error. The
lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off
field contain the offset relative to the start of the zero copy area.
The upper part of the off field is trivially zero, and will be used
to carry the area id.
For now, there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data. This
request always operates in multishot mode.
Pavel Begunkov [Sat, 15 Feb 2025 00:09:40 +0000 (16:09 -0800)]
io_uring/zcrx: dma-map area for the device
Setup DMA mappings for the area into which we intend to receive data
later on. We know the device we want to attach to even before we get a
page pool and can pre-map in advance. All net_iov are synchronised for
device when allocated, see page_pool_mp_return_in_cache().
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-6-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.
Unlike the traditional receive, that frees pages and returns them back
to the page pool right after data was copied to the user, e.g. inside
recv(2), we extend the lifetime until the user space confirms that it's
done processing the data. That's done by taking a net_iov reference.
When the user is done with the buffer, it must return it back to the
kernel by posting an entry into the refill ring, which is usually polled
off the io_uring memory provider callback in the page pool's netmem
allocation path.
There is also a separate set of per net_iov "user" references accounting
whether a buffer is currently given to the user (including possible
fragmentation).
Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-5-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 15 Feb 2025 00:09:38 +0000 (16:09 -0800)]
io_uring/zcrx: grab a net device
Zerocopy receive needs a net device to bind to its rx queue and dma map
buffers. As a preparation to following patches, resolve a net device
from the if_idx parameter with no functional changes otherwise.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-4-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Wei [Sat, 15 Feb 2025 00:09:37 +0000 (16:09 -0800)]
io_uring/zcrx: add io_zcrx_area
Add io_zcrx_area that represents a region of userspace memory that is
used for zero copy. During ifq registration, userspace passes in the
uaddr and len of userspace memory, which is then pinned by the kernel.
Each net_iov is mapped to one of these pages.
The freelist is a spinlock protected list that keeps track of all the
net_iovs/pages that aren't used.
For now, there is only one area per ifq and area registration happens
implicitly as part of ifq registration. There is no API for
adding/removing areas yet. The struct for area registration is there for
future extensibility once we support multiple areas and TCP devmem.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Wei <dw@davidwei.uk> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20250215000947.789731-3-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Wei [Sat, 15 Feb 2025 00:09:36 +0000 (16:09 -0800)]
io_uring/zcrx: add interface queue and refill queue
Add a new object called an interface queue (ifq) that represents a net
rx queue that has been configured for zero copy. Each ifq is registered
using a new registration opcode IORING_REGISTER_ZCRX_IFQ.
The refill queue is allocated by the kernel and mapped by userspace
using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main
SQ/CQ. It is used by userspace to return buffers that it is done with,
which will then be re-used by the netdev again.
The main CQ ring is used to notify userspace of received data by using
the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each
entry contains the offset + len to the data.
For now, each io_uring instance only has a single ifq.
Merge networking zerocopy receive tree, to get the prep patches for
the io_uring rx zc support.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (63 commits)
net: add helpers for setting a memory provider on an rx queue
net: page_pool: add memory provider helpers
net: prepare for non devmem TCP memory providers
net: page_pool: add a mp hook to unregister_netdevice*
net: page_pool: add callback for mp info printing
netdev: add io_uring memory provider info
net: page_pool: create hooks for custom memory providers
net: generalise net_iov chunk owners
net: prefix devmem specific helpers
net: page_pool: don't cast mp param to devmem
tools: ynl: add all headers to makefile deps
eth: fbnic: set IFF_UNICAST_FLT to avoid enabling promiscuous mode when adding unicast addrs
eth: fbnic: add MAC address TCAM to debugfs
tools: ynl-gen: support limits using definitions
tools: ynl-gen: don't output external constants
net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabled
net/mlx5e: Remove unused mlx5e_tc_flow_action struct
net/mlx5: Remove stray semicolon in LAG port selection table creation
net/mlx5e: Support FEC settings for 200G per lane link modes
net/mlx5: Add support for 200Gbps per lane link modes
...
8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.
So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.
In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.
Also add a comment to struct io_tw_state to explain its purpose.
io_uring: pass ctx instead of req to io_init_req_drain()
io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.
Drop "req" from the function name since it operates on the ctx rather
than a specific req.
Jens Axboe [Sat, 8 Feb 2025 17:50:34 +0000 (10:50 -0700)]
io_uring/net: improve recv bundles
Current recv bundles are only supported for multishot receives, and
additionally they also always post at least 2 CQEs if more data is
available than what a buffer will hold. This happens because the initial
bundle recv will do a single buffer, and then do the rest of what is in
the socket as a followup receive. As shown in a test program, if 1k
buffers are available and 32k is available to receive in the socket,
you'd get the following completions:
bundle=1, mshot=0
cqe res 1024
cqe res 1024
[...]
cqe res 1024
bundle=1, mshot=1
cqe res 1024
cqe res 31744
where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 &&
mshot=1 will post a 1k completion and then a 31k completion.
To support bundle recv without multishot, it's possible to simply retry
the recv immediately and post a single completion, rather than split it
into two completions. With the below patch, the same test looks as
follows:
bundle=1, mshot=0
cqe res 32768
bundle=1, mshot=1
cqe res 32768
where mshot=0 works fine for bundles, and both of them post just a
single 32k completion rather than split it into separate completions.
Posting fewer completions is always a nice win, and not needing
multishot for proper bundle efficiency is nice for cases that can't
necessarily use multishot.
Jens Axboe [Wed, 5 Feb 2025 20:13:58 +0000 (13:13 -0700)]
io_uring/cancel: add generic cancel helper
Any opcode that is cancelable ends up defining its own cancel helper
for finding and canceling a specific request. Add a generic helper that
can be used for this purpose.
Jens Axboe [Wed, 5 Feb 2025 19:48:56 +0000 (12:48 -0700)]
io_uring/cancel: add generic remove_all helper
Any opcode that is cancelable ends up defining its own remove all
helper, which iterates the pending list and cancels matches. Add a
generic helper for it, which can be used by them.
Pavel Begunkov [Wed, 5 Feb 2025 11:36:49 +0000 (11:36 +0000)]
io_uring/kbuf: uninline __io_put_kbufs
__io_put_kbufs() and other helper functions are too large to be inlined,
compilers would normally refuse to do so. Uninline it and move together
with io_kbuf_commit into kbuf.c.
io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:48 +0000 (11:36 +0000)]
io_uring/kbuf: introduce io_kbuf_drop_legacy()
io_kbuf_drop() is only used for legacy provided buffers, and so
__io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the
dead branch out of __io_put_kbuf_list(), rename it into
io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop().
Pavel Begunkov [Wed, 5 Feb 2025 11:36:46 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf caching
Remove all struct io_buffer caches. It makes it a fair bit simpler.
Apart from from killing a bunch of lines and juggling between lists,
__io_put_kbuf_list() doesn't need ->completion_lock locking now.
Pavel Begunkov [Wed, 5 Feb 2025 11:36:45 +0000 (11:36 +0000)]
io_uring/kbuf: simplify __io_put_kbuf
As a preparation step remove an optimisation from __io_put_kbuf() trying
to use the locked cache. With that __io_put_kbuf_list() is only used
with ->io_buffers_comp, and we remove the explicit list argument.
Pavel Begunkov [Wed, 5 Feb 2025 11:36:42 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf bulk allocation
Legacy provided buffers are slow and discouraged in favour of the ring
variant. Remove the bulk allocation to keep it simpler as we don't care
about performance.