As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there
is a race to free the HugeTLB page between PageHuge() and folio_hstate().
There is no need to add hugetlb_lock here as the HugeTLB page can be freed
in lot of places. So it's enough to unfold folio_hstate() and add a check
to avoid NULL pointer dereference for hugepage_migration_supported().
Link: https://lkml.kernel.org/r/20250122061151.578768-1-liushixin2@huawei.com Fixes: 464c7ffbcb16 ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 8 Jan 2025 07:48:21 +0000 (00:48 -0700)]
mm/hugetlb_vmemmap: fix memory loads ordering
Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB
hugeTLB, HVO reduces the area to 4KB by the following steps:
1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
by PTE 0, and at the same time change the permission from r/w to
r/o;
3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
to 4KB.
However, the following race can happen due to improperly memory loads
ordering:
CPU 1 (HVO) CPU 2 (speculative PFN walker)
atomic_add_unless(&page->_refcount)
XXX: try to modify r/o struct page[]
Specifically, page_is_fake_head() must be ordered after page_ref_count()
on CPU 2 so that it can only return true for this case, to avoid the later
attempt to modify r/o struct page[].
This patch adds the missing memory barrier and makes the tests on
page_is_fake_head() and page_ref_count() done in the proper order.
Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Will Deacon <will@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kemeng Shi [Sat, 22 Feb 2025 16:08:46 +0000 (00:08 +0800)]
mm: swap: use correct step in loop to wait all clusters in wait_for_allocation()
Use correct step in loop to wait all clusters in wait_for_allocation().
If we miss some cluster in wait_for_allocation(), use after free may occur
as follows:
shmem_writepage swapoff
folio_alloc_swap
get_swap_pages
scan_swap_map_slots
cluster_alloc_swap_entry
alloc_swap_scan_cluster
cluster_alloc_range
/* SWP_WRITEOK is valid */
if (!(si->flags & SWP_WRITEOK))
...
del_from_avail_list(p, true);
...
/* miss the cluster in shmem_writepage */
wait_for_allocation()
...
try_to_unuse()
...
add_to_swap_cache
/* dereference swap_address_space(entry) which is NULL */
xas_lock_irq(&xas);
Link: https://lkml.kernel.org/r/20250222160850.505274-3-shikemeng@huaweicloud.com Fixes: 9a0ddeb79880 ("mm, swap: hold a reference during scan and cleanup flag usage") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kemeng Shi [Sat, 22 Feb 2025 16:08:45 +0000 (00:08 +0800)]
mm: swap: avoid losing cluster in swap_reclaim_full_clusters()
If no swap cache is reclaimed, cluster taken off from full_clusters list
will not be put in any list and may not be reused. Do relocate_cluster
for such cluster to fix the issue.
Link: https://lkml.kernel.org/r/20250222160850.505274-2-shikemeng@huaweicloud.com Fixes: 3b644773eefda ("mm, swap: reduce contention on device lock") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Sat, 22 Feb 2025 16:19:52 +0000 (16:19 +0000)]
mm: abort vma_modify() on merge out of memory failure
The remainder of vma_modify() relies upon the vmg state remaining pristine
after a merge attempt.
Usually this is the case, however in the one edge case scenario of a merge
attempt failing not due to the specified range being unmergeable, but
rather due to an out of memory error arising when attempting to commit the
merge, this assumption becomes untrue.
This results in vmg->start, end being modified, and thus the proceeding
attempts to split the VMA will be done with invalid start/end values.
Thankfully, it is likely practically impossible for us to hit this in
reality, as it would require a maple tree node pre-allocation failure that
would likely never happen due to it being 'too small to fail', i.e. the
kernel would simply keep retrying reclaim until it succeeded.
However, this scenario remains theoretically possible, and what we are
doing here is wrong so we must correct it.
The safest option is, when this scenario occurs, to simply give up the
operation. If we cannot allocate memory to merge, then we cannot allocate
memory to split either (perhaps moreso!).
Any scenario where this would be happening would be under very extreme
(likely fatal) memory pressure, so it's best we give up early.
So there is no doubt it is appropriate to simply bail out in this
scenario.
However, in general we must if at all possible never assume VMG state is
stable after a merge attempt, since merge operations update VMG fields.
As a result, additionally also make this clear by storing start, end in
local variables.
The issue was reported originally by syzkaller, and by Brad Spengler (via
an off-list discussion), and in both instances it manifested as a
triggering of the assert:
VM_WARN_ON_VMG(start >= end, vmg);
In vma_merge_existing_range().
It seems at least one scenario in which this is occurring is one in which
the merge being attempted is due to an madvise() across multiple VMAs
which looks like this:
start end
|<------>|
|----------|------|
| vma | next |
|----------|------|
When madvise_walk_vmas() is invoked, we first find vma in the above
(determining prev to be equal to vma as we are offset into vma), and then
enter the loop.
We determine the end of vma that forms part of the range we are
madvise()'ing by setting 'tmp' to this value:
Where the visit() function pointer in this instance is
madvise_vma_behavior().
As observed in syzkaller reports, it is ultimately madvise_update_vma()
that is invoked, calling vma_modify_flags_name() and vma_modify() in turn.
Then, in vma_modify(), we attempt the merge:
merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
We invoke this with vmg->start, end set to start, tmp as such:
start tmp
|<--->|
|----------|------|
| vma | next |
|----------|------|
We find ourselves in the merge right scenario, but the one in which we
cannot remove the middle (we are offset into vma).
Here we have a special case where vmg->start, end get set to perhaps
unintuitive values - we intended to shrink the middle VMA and expand the
next.
This means vmg->start, end are set to... vma->vm_start, start.
Now the commit_merge() fails, and vmg->start, end are left like this.
This means we return to the rest of vma_modify() with vmg->start, end
(here denoted as start', end') set as:
start' end'
|<-->|
|----------|------|
| vma | next |
|----------|------|
So we now erroneously try to split accordingly. This is where the
unfortunate stuff begins.
We start with:
/* Split any preceding portion of the VMA. */
if (vma->vm_start < vmg->start) {
...
}
This doesn't trigger as we are no longer offset into vma at the start.
But then we invoke:
/* Split any trailing portion of the VMA. */
if (vma->vm_end > vmg->end) {
...
}
Which does get invoked. This leaves us with:
start' end'
|<-->|
|----|-----|------|
| vma| new | next |
|----|-----|------|
We then return ultimately to madvise_walk_vmas(). Here 'new' is unknown,
and putting back the values known in this function we are faced with:
start tmp end
| | |
|----|-----|------|
| vma| new | next |
|----|-----|------|
prev
Then:
start = tmp;
So:
start end
| |
|----|-----|------|
| vma| new | next |
|----|-----|------|
prev
The following code does not cause anything to happen:
if (prev && start < prev->vm_end)
start = prev->vm_end;
if (start >= end)
break;
And then we invoke:
if (prev)
vma = find_vma(mm, prev->vm_end);
Which is where a problem occurs - we don't know about 'new' so we
essentially look for the vma after prev, which is new, whereas we actually
intended to discover next!
So we end up with:
start end
| |
|----|-----|------|
|prev| vma | next |
|----|-----|------|
And we have successfully bypassed all of the checks madvise_walk_vmas()
has to ensure early exit should we end up moving out of range.
Where start == tmp. That is, a zero range. This is not good.
We invoke visit() which is madvise_vma_behavior() which does not check the
range (for good reason, it assumes all checks have been done before it was
called), which in turn finally calls madvise_update_vma().
The madvise_update_vma() function calls vma_modify_flags_name() in turn,
which ultimately invokes vma_modify() with... start == end.
vma_modify() calls vma_merge_existing_range() and finally we hit:
VM_WARN_ON_VMG(start >= end, vmg);
Which triggers, as start == end.
While it might be useful to add some CONFIG_DEBUG_VM asserts in these
instances to catch this kind of error, since we have just eliminated any
possibility of that happening, we will add such asserts separately as to
reduce churn and aid backporting.
Ge Yang [Wed, 19 Feb 2025 03:46:44 +0000 (11:46 +0800)]
mm/hugetlb: wait for hugetlb folios to be freed
Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing
of huge pages if in non-task context"), which supports deferring the
freeing of hugetlb pages, the allocation of contiguous memory through
cma_alloc() may fail probabilistically.
In the CMA allocation process, if it is found that the CMA area is
occupied by in-use hugetlb folios, these in-use hugetlb folios need to be
migrated to another location. When there are no available hugetlb folios
in the free hugetlb pool during the migration of in-use hugetlb folios,
new folios are allocated from the buddy system. A temporary state is set
on the newly allocated folio. Upon completion of the hugetlb folio
migration, the temporary state is transferred from the new folios to the
old folios. Normally, when the old folios with the temporary state are
freed, it is directly released back to the buddy system. However, due to
the deferred freeing of hugetlb pages, the PageBuddy() check fails,
ultimately leading to the failure of cma_alloc().
Here is a simplified call trace illustrating the process:
cma_alloc()
->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios
->unmap_and_move_huge_page()
->folio_putback_hugetlb() // Free old folios
->test_pages_isolated()
->__test_page_isolated_in_pageblock()
->PageBuddy(page) // Check if the page is in buddy
To resolve this issue, we have implemented a function named
wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb
folios are properly released back to the buddy system after their
migration is completed. By invoking wait_for_freed_hugetlb_folios()
before calling PageBuddy(), we ensure that PageBuddy() will succeed.
Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gaoxu [Wed, 19 Feb 2025 01:56:28 +0000 (01:56 +0000)]
mm: fix possible NULL pointer dereference in __swap_duplicate
Add a NULL check on the return value of swp_swap_info in __swap_duplicate
to prevent crashes caused by NULL pointer dereference.
The reason why swp_swap_info() returns NULL is unclear; it may be due
to CPU cache issues or DDR bit flips. The probability of this issue is
very small - it has been observed to occur approximately 1 in 500,000
times per week. The stack info we encountered is as follows:
The patch seems to only provide a workaround, but there are no more
effective software solutions to handle the bit flips problem. This path
will change the issue from a system crash to a process exception, thereby
reducing the impact on the entire machine.
Signed-off-by: gao xu <gaoxu2@honor.com> Link: https://lkml.kernel.org/r/e223b0e6ba2f4924984b1917cc717bd5@honor.com Reviewed-by: Barry Song <baohua@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmsan_handle_dma() is used by virtio_ring() which can be built as a
module. kmsan_handle_dma() needs to be exported otherwise building the
virtio_ring fails.
Export kmsan_handle_dma for modules.
Link: https://lkml.kernel.org/r/20250218091411.MMS3wBN9@linutronix.de Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502150634.qjxwSeJR-lkp@intel.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Macro Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gwan-gyeong Mun [Mon, 17 Feb 2025 11:41:33 +0000 (13:41 +0200)]
x86/vmemmap: use direct-mapped VA instead of vmemmap-based VA
Address an Oops issues when performing test of loading XE GPU driver
module after applying the GPU SVM and Xe SVM patch series[1] and the Dept
patch series[2].
The issue occurs when loading the xe driver via modprobe [3], which adds a
struct page for device memory via devm_memremap_pages(). When a process
leads the addition of a struct page to vmemmap (e.g. hot-plug), the page
table update for the newly added vmemmap-based virtual address is updated
first in init_mm's page table and then synchronized later.
If the vmemmap-based virtual address is accessed through the process's
page table before this sync, a page fault will occur. This patch
translates vmemmap-based virtual address to direct-mapped virtual address
and use it, if the current top-level page table is not init_mm's page
table when accessing a vmemmap-based virtual address before this sync.
When a process leads the addition of a struct page to vmemmap
(e.g. hot-plug), the page table update for the newly added vmemmap-based
virtual address is updated first in init_mm's page table and then
synchronized later.
If the vmemmap-based virtual address is accessed through the process's
page table before this sync, a page fault will occur.
This translates vmemmap-based virtual address to direct-mapped virtual
address and use it, if the current top-level page table is not init_mm's
page table when accessing a vmemmap-based virtual address before this sync.
Link: https://lkml.kernel.org/r/20250217114133.400063-2-gwan-gyeong.mun@intel.com Fixes: faf1c0008a33 ("x86/vmemmap: optimize for consecutive sections in partial populated PMDs") Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ma Wupeng [Mon, 17 Feb 2025 01:43:29 +0000 (09:43 +0800)]
hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folio
Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to
be offlined) add page poison checks in do_migrate_range in order to make
offline hwpoisoned page possible by introducing isolate_lru_page and
try_to_unmap for hwpoisoned page. However folio lock must be held before
calling try_to_unmap. Add it to fix this problem.
Warning will be produced if folio is not locked during unmap:
Link: https://lkml.kernel.org/r/20250217014329.3610326-4-mawupeng1@huawei.com Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ma Wupeng [Mon, 17 Feb 2025 01:43:28 +0000 (09:43 +0800)]
mm: memory-hotplug: check folio ref count first in do_migrate_range
If a folio has an increased reference count, folio_try_get() will acquire
it, perform necessary operations, and then release it. In the case of a
poisoned folio without an elevated reference count (which is unlikely for
memory-failure), folio_try_get() will simply bypass it.
Therefore, relocate the folio_try_get() function, responsible for checking
and acquiring this reference count at first.
Link: https://lkml.kernel.org/r/20250217014329.3610326-3-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
unmap_poisoned_folio(): remove shadowed local `mapping', per Miaohe
Link: https://lkml.kernel.org/r/20250219060653.3849083-1-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ma Wupeng [Mon, 17 Feb 2025 01:43:27 +0000 (09:43 +0800)]
mm: memory-failure: update ttu flag inside unmap_poisoned_folio
Patch series "mm: memory_failure: unmap poisoned folio during migrate
properly", v3.
Fix two bugs during folio migration if the folio is poisoned.
This patch (of 3):
Commit 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to
TTU_HWPOISON") introduce TTU_HWPOISON to replace TTU_IGNORE_HWPOISON in
order to stop send SIGBUS signal when accessing an error page after a
memory error on a clean folio. However during page migration, anon folio
must be set with TTU_HWPOISON during unmap_*(). For pagecache we need
some policy just like the one in hwpoison_user_mappings to set this flag.
So move this policy from hwpoison_user_mappings to unmap_poisoned_folio to
handle this warning properly.
Warning will be produced during unamp poison folio with the following log:
Link: https://lkml.kernel.org/r/20250217014329.3610326-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-2-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Mon, 17 Feb 2025 02:49:24 +0000 (10:49 +0800)]
arm: pgtable: fix NULL pointer dereference issue
When update_mmu_cache_range() is called by update_mmu_cache(), the vmf
parameter is NULL, which will cause a NULL pointer dereference issue in
adjust_pte():
Unable to handle kernel NULL pointer dereference at virtual address 00000030 when read
Hardware name: Atmel AT91SAM9
PC is at update_mmu_cache_range+0x1e0/0x278
LR is at pte_offset_map_rw_nolock+0x18/0x2c
Call trace:
update_mmu_cache_range from remove_migration_pte+0x29c/0x2ec
remove_migration_pte from rmap_walk_file+0xcc/0x130
rmap_walk_file from remove_migration_ptes+0x90/0xa4
remove_migration_ptes from migrate_pages_batch+0x6d4/0x858
migrate_pages_batch from migrate_pages+0x188/0x488
migrate_pages from compact_zone+0x56c/0x954
compact_zone from compact_node+0x90/0xf0
compact_node from kcompactd+0x1d4/0x204
kcompactd from kthread+0x120/0x12c
kthread from ret_from_fork+0x14/0x38
Exception stack(0xc0d8bfb0 to 0xc0d8bff8)
To fix it, do not rely on whether 'ptl' is equal to decide whether to hold
the pte lock, but decide it by whether CONFIG_SPLIT_PTE_PTLOCKS is
enabled. In addition, if two vmas map to the same PTE page, there is no
need to hold the pte lock again, otherwise a deadlock will occur. Just
add the need_lock parameter to let adjust_pte() know this information.
Link: https://lkml.kernel.org/r/20250217024924.57996-1-zhengqi.arch@bytedance.com Fixes: fc9c45b71f43 ("arm: adjust_pte() use pte_offset_map_rw_nolock()") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Ezra Buehler <ezra.buehler@husqvarnagroup.com> Closes: https://lore.kernel.org/lkml/CAM1KZSmZ2T_riHvay+7cKEFxoPgeVpHkVFTzVVEQ1BO0cLkHEQ@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Ezra Buehler <ezra.buehler@husqvarnagroup.com> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russel King <linux@armlinux.org.uk> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 17 Feb 2025 18:23:04 +0000 (10:23 -0800)]
selftests/damon/damos_quota_goal: handle minimum quota that cannot be further reduced
damos_quota_goal.py selftest see if DAMOS quota goals tuning feature
increases or reduces the effective size quota for given score as expected.
The tuning feature sets the minimum quota size as one byte, so if the
effective size quota is already one, we cannot expect it further be
reduced. However the test is not aware of the edge case, and fails since
it shown no expected change of the effective quota. Handle the case by
updating the failure logic for no change to see if it was the case, and
simply skips to next test input.
Link: https://lkml.kernel.org/r/20250217182304.45215-1-sj@kernel.org Fixes: f1c07c0a1662 ("selftests/damon: add a test for DAMOS quota goal") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202502171423.b28a918d-lkp@intel.com Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Cc: <stable@vger.kernel.org> [6.10.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The general approach described in commit e076eaca5906 ("selftests: break
the dependency upon local header files") was taken one step too far here:
it should not have been extended to include the syscall numbers. This is
because doing so would require per-arch support in tools/include/uapi, and
no such support exists.
This revert fixes two separate reports of test failures, from Dave
Hansen[1], and Li Wang[2]. An excerpt of Dave's report:
Linus Torvalds [Sun, 23 Feb 2025 01:32:00 +0000 (17:32 -0800)]
Merge tag 'v6.14-rc3-smb3-client-fix-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
- Fix potential null pointer dereference
* tag 'v6.14-rc3-smb3-client-fix-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Add check for next_buffer in receive_encrypted_standard()
Linus Torvalds [Sat, 22 Feb 2025 18:45:02 +0000 (10:45 -0800)]
Merge tag 'x86-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix AVX-VNNI CPU feature dependency bug triggered via the 'noxsave'
boot option
- Fix typos in the SVA documentation
- Add Tony Luck as RDT co-maintainer and remove Fenghua Yu
* tag 'x86-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
docs: arch/x86/sva: Fix two grammar errors under Background and FAQ
x86/cpufeatures: Make AVX-VNNI depend on AVX
MAINTAINERS: Change maintainer for RDT
Linus Torvalds [Sat, 22 Feb 2025 17:30:04 +0000 (09:30 -0800)]
Merge tag 'sched-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq fixes from Ingo Molnar:
- Fix overly spread-out RSEQ concurrency ID allocation pattern that
regressed certain workloads
- Fix RSEQ registration syscall behavior on -EFAULT errors when
CONFIG_DEBUG_RSEQ=y (This debug option is disabled on most
distributions)
* tag 'sched-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Fix rseq registration with CONFIG_DEBUG_RSEQ
sched: Compact RSEQ concurrency IDs with reduced threads and affinity
Linus Torvalds [Sat, 22 Feb 2025 17:26:12 +0000 (09:26 -0800)]
Merge tag 'perf-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
"Fix x86 Intel Lion Cove CPU event constraints, and fix uprobes
debug/error printk output pointer-value verbosity"
* tag 'perf-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix event constraints for LNC
uprobes: Don't use %pK through printk
Linus Torvalds [Sat, 22 Feb 2025 17:09:33 +0000 (09:09 -0800)]
Merge tag 's390-6.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix inline asm constraint in cmma_test_essa() to avoid potential ESSA
detection miscompilation
- Fix build failure with CONFIG_GENDWARFKSYMS by disabling purgatory
symbol exports with -D__DISABLE_EXPORTS
- Update defconfigs
* tag 's390-6.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/boot: Fix ESSA detection
s390/purgatory: Use -D__DISABLE_EXPORTS
s390: Update defconfigs
Linus Torvalds [Sat, 22 Feb 2025 17:03:54 +0000 (09:03 -0800)]
Merge tag 'ftrace-v6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"Function graph accounting fixes:
- Fix the manage ops hashes
The function graph registers a "manager ops" and "sub-ops" to
ftrace. The manager ops does not have any callback but calls the
sub-ops callbacks. The manage ops hashes (what is used to tell
ftrace what functions to attach to) is built on the sub-ops it
manages.
There was an error in the way it built the hash. An empty hash
means to attach to all functions. When the manager ops had one
sub-ops it properly copied its hash. But when the manager ops had
more than one sub-ops, it went into a loop to make a set of all
functions it needed to add to the hash. If any of the subops hashes
was empty, that would mean to attach to all functions. The error
was that the first iteration of the loop passed in an empty hash to
start with in order to add the other hashes. That starting hash was
mistaken as to attach to all functions. This made the manage ops
attach to all functions whenever it had two or more sub-ops, even
if each sub-op was attached to only a single function.
- Do not add duplicate entries to the manager ops hash
If two or more subops hashes trace the same function, an entry for
that function will be added to the manager ops for each subops.
This causes waste and extra overhead.
Fprobe accounting fixes:
- Remove last function from fprobe hash
Fprobes has a ftrace hash to manage which functions an fprobe is
attached to. It also has a counter of how many fprobes are
attached. When the last fprobe is removed, it unregisters the
fprobe from ftrace but does not remove the functions the last
fprobe was attached to from the hash. This leaves the old functions
attached. When a new fprobe is added, the fprobe infrastructure
attaches to not only the functions of the new fprobe, but also to
the functions of the last fprobe.
- Fix accounting of the fprobe counter
When a fprobe is added, it updates a counter. If the counter goes
from zero to one, it attaches its ops to ftrace. When an fprobe is
removed, the counter is decremented. If the counter goes from 1 to
zero, it removes the fprobes ops from ftrace.
There was an issue where if two fprobes trace the same function,
the addition of each fprobe would increment the counter. But when
removing the first of the fprobes, it would notice that another
fprobe is still attached to one of its functions no it does not
remove the functions from the ftrace ops.
But it also did not decrement the counter, so when the last fprobe
is removed, the counter is still one. This leaves the fprobes
callback still registered with ftrace and it being called by the
functions defined by the fprobes ops hash. Worse yet, because all
the functions from the fprobe ops hash have been removed, that
tells ftrace that it wants to trace all functions.
Thus, this puts the state of the system where every function is
calling the fprobe callback handler (which does nothing as there
are no registered fprobes), but this causes a good 13% slow down of
the entire system.
Other updates:
- Add a selftest to test the above issues to prevent regressions.
- Fix preempt count accounting in function tracing
Better recursion protection was added to function tracing which
added another layer of preempt disable. As the preempt_count gets
traced in the event, it needs to subtract the amount of preempt
disabling the tracer does to record what the preempt_count was when
the trace was triggered.
- Fix memory leak in output of set_event
A variable is passed by the seq_file functions in the location that
is set by the return of the next() function. The start() function
allocates it and the stop() function frees it. But when the last
item is found, the next() returns NULL which leaks the data that
was allocated in start(). The m->private is used for something
else, so have next() free the data when it returns NULL, as stop()
will then just receive NULL in that case"
* tag 'ftrace-v6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix memory leak when reading set_event file
ftrace: Correct preemption accounting for function tracing.
selftests/ftrace: Update fprobe test to check enabled_functions file
fprobe: Fix accounting of when to unregister from function graph
fprobe: Always unregister fgraph function from ops
ftrace: Do not add duplicate entries in subops manager ops
ftrace: Fix accounting of adding subops to a manager ops
drivers/i2c/i2c-core-base.c: In function ‘i2c_detect.isra’:
drivers/i2c/i2c-core-base.c:2544:1: warning: the frame size of 1312 bytes is larger than 1024 bytes [-Wframe-larger-than=]
2544 | }
| ^
Fix this by allocating the temporary client structure dynamically, as it
is a rather large structure (1216 bytes, depending on kernel config).
This is basically a revert of the to-be-fixed commit with some
checkpatch improvements.
Fixes: 735668f8e5c9 ("i2c: core: Allocate temp client on the stack in i2c_detect") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Su Hui <suhui@nfschina.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net>
[wsa: updated commit message, merged tags from similar patch] Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Linus Torvalds [Fri, 21 Feb 2025 21:16:01 +0000 (13:16 -0800)]
Merge tag 'soc-fixes-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"Two people stepped up as platform co-maintainers: Andrew Jeffery for
ASpeed and Janne Grunau for Apple.
The rockchip platform gets 9 small fixes for devicetree files,
addressing both compile-time warnings and board specific bugs.
One bugfix for the optee firmware driver addresses a reboot-time hang.
Two drivers need improved Kconfig dependencies to allow wider compile-
testing while hiding the drivers on platforms that can't use them.
ARM SCMI and loongson-guts drivers get minor bugfixes"
* tag 'soc-fixes-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
soc: loongson: loongson2_guts: Add check for devm_kstrdup()
tee: optee: Fix supplicant wait loop
platform: cznic: CZNIC_PLATFORMS should depend on ARCH_MVEBU
firmware: imx: IMX_SCMI_MISC_DRV should depend on ARCH_MXC
MAINTAINERS: arm: apple: Add Janne as maintainer
MAINTAINERS: Mark Andrew as M: for ASPEED MACHINE SUPPORT
firmware: arm_scmi: imx: Correct tx size of scmi_imx_misc_ctrl_set
arm64: dts: rockchip: adjust SMMU interrupt type on rk3588
arm64: dts: rockchip: disable IOMMU when running rk3588 in PCIe endpoint mode
dt-bindings: rockchip: pmu: Ensure all properties are defined
arm64: defconfig: Enable TISCI Interrupt Router and Aggregator
arm64: dts: rockchip: Fix lcdpwr_en pin for Cool Pi GenBook
arm64: dts: rockchip: fix fixed-regulator renames on rk3399-gru devices
arm64: dts: rockchip: Disable DMA for uart5 on px30-ringneck
arm64: dts: rockchip: Move uart5 pin configuration to px30 ringneck SoM
arm64: dts: rockchip: change eth phy mode to rgmii-id for orangepi r1 plus lts
arm64: dts: rockchip: Fix broken tsadc pinctrl names for rk3588
Linus Torvalds [Fri, 21 Feb 2025 21:10:22 +0000 (13:10 -0800)]
Merge tag 'drm-fixes-2025-02-22' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes pull request, lots of small things all over, msm has
a bunch of things but all very small, xe, i915, a fix for the cgroup
dmem controller.
core:
- remove MAINTAINERS entry
cgroup/dmem:
- use correct function for pool descendants
panel:
- fix signal polarity issue jd9365da-h3
nouveau:
- folio handling fix
- config fix
amdxdna:
- fix missing header
xe:
- Fix error handling in xe_irq_install
- Fix devcoredump format
i915:
- Use spin_lock_irqsave() in interruptible context on guc submission
- Fixes on DDI and TRANS programming
- Make sure all planes in use by the joiner have their crtc included
- Fix 128b/132b modeset issues
msm:
- More catalog fixes:
- to skip watchdog programming through top block if its not
present
- fix the setting of WB mask to ensure the WB input control is
programmed correctly through ping-pong
- drop lm_pair for sm6150 as that chipset does not have any
3dmerge block
- Fix the mode validation logic for DP/eDP to account for widebus
(2ppc) to allow high clock resolutions
- Fix to disable dither during encoder disable as otherwise this
was causing kms_writeback failure due to resource sharing
between WB and DSI paths as DSI uses dither but WB does not
- Fixes for virtual planes, namely to drop extraneous return and
fix uninitialized variables
- Fix to avoid spill-over of DSC encoder block bits when
programming the bits-per-component
- Fixes in the DSI PHY to protect against concurrent access of
PHY_CMN_CLK_CFG regs between clock and display drivers
- Core/GPU:
- Fix non-blocking fence wait incorrectly rounding up to 1 jiffy
timeout
- Only print GMU fw version once, instead of each time the GPU
resumes"
* tag 'drm-fixes-2025-02-22' of https://gitlab.freedesktop.org/drm/kernel: (28 commits)
drm/i915/dp: Fix disabling the transcoder function in 128b/132b mode
drm/i915/dp: Fix error handling during 128b/132b link training
accel/amdxdna: Add missing include linux/slab.h
MAINTAINERS: Remove myself
drm/nouveau/pmu: Fix gp10b firmware guard
cgroup/dmem: Don't open-code css_for_each_descendant_pre
drm/xe/guc: Fix size_t print format
drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump
drm/i915: Make sure all planes in use by the joiner have their crtc included
drm/i915/ddi: Fix HDMI port width programming in DDI_BUF_CTL
drm/i915/dsi: Use TRANS_DDI_FUNC_CTL's own port width macro
drm/xe: Fix error handling in xe_irq_install()
drm/i915/gt: Use spin_lock_irqsave() in interruptible context
drm/msm/dsi/phy: Do not overwite PHY_CMN_CLK_CFG1 when choosing bitclk source
drm/msm/dsi/phy: Protect PHY_CMN_CLK_CFG1 against clock driver
drm/msm/dsi/phy: Protect PHY_CMN_CLK_CFG0 updated from driver side
drm/msm/dpu: Drop extraneous return in dpu_crtc_reassign_planes()
drm/msm/dpu: Don't leak bits_per_component into random DSC_ENC fields
drm/msm/dpu: Disable dither in phys encoder cleanup
drm/msm/dpu: Fix uninitialized variable
...
Linus Torvalds [Fri, 21 Feb 2025 17:36:28 +0000 (09:36 -0800)]
Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- FC controller state check fixes (Daniel)
- PCI Endpoint fixes (Damien)
- TCP connection failure fixe (Caleb)
- TCP handling C2HTermReq PDU (Maurizio)
- RDMA queue state check (Ruozhu)
- Apple controller fixes (Hector)
- Target crash on disbaled namespace (Hannes)
- MD pull request via Yu:
- Fix queue limits error handling for raid0, raid1 and raid10
- Fix for a NULL pointer deref in request data mapping
- Code cleanup for request merging
* tag 'block-6.14-20250221' of git://git.kernel.dk/linux:
nvme: only allow entering LIVE from CONNECTING state
nvme-fc: rely on state transitions to handle connectivity loss
apple-nvme: Support coprocessors left idle
apple-nvme: Release power domains when probe fails
nvmet: Use enum definitions instead of hardcoded values
nvme: Cleanup the definition of the controller config register fields
nvme/ioctl: add missing space in err message
nvme-tcp: fix connect failure on receiving partial ICResp PDU
nvme: tcp: Fix compilation warning with W=1
nvmet: pci-epf: Avoid RCU stalls under heavy workload
nvmet: pci-epf: Do not uselessly write the CSTS register
nvmet: pci-epf: Correctly initialize CSTS when enabling the controller
nvmet-rdma: recheck queue state is LIVE in state lock in recv done
nvmet: Fix crash when a namespace is disabled
nvme-tcp: add basic support for the C2HTermReq PDU
nvme-pci: quirk Acer FA100 for non-uniqueue identifiers
block: fix NULL pointer dereferenced within __blk_rq_map_sg
block/merge: remove unnecessary min() with UINT_MAX
md/raid*: Fix the set_queue_limits implementations
Linus Torvalds [Fri, 21 Feb 2025 17:17:56 +0000 (09:17 -0800)]
Merge tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Series fixing an issue with multishot read on pollable files that may
return -EIOCBQUEUED from ->read_iter(). Four small patches for that,
the first one deliberately done in such a way that it'd be easy to
backport
- Remove some dead constant definitions
- Use array_index_nospec() for opcode indexing
- Work-around for worker creation retries in the presence of signals
* tag 'io_uring-6.14-20250221' of git://git.kernel.dk/linux:
io_uring/rw: clean up mshot forced sync mode
io_uring/rw: move ki_complete init into prep
io_uring/rw: don't directly use ki_complete
io_uring/rw: forbid multishot async reads
io_uring/rsrc: remove unused constants
io_uring: fix spelling error in uapi io_uring.h
io_uring: prevent opcode speculation
io-wq: backoff when retrying worker creation
Linus Torvalds [Fri, 21 Feb 2025 17:11:25 +0000 (09:11 -0800)]
Merge tag 'acpi-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix a memory leak in the ACPI platform_profile driver (Kurt Borja)"
* tag 'acpi-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: platform_profile: Fix memory leak in profile_class_is_visible()
Linus Torvalds [Fri, 21 Feb 2025 17:07:04 +0000 (09:07 -0800)]
Merge tag 'mtd/fixes-for-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal:
"The two most important fixes in this list are probably the SST write
failure and the Qcom raw NAND controller probe failure which are due
to some refactoring, otherwise there has been a series of misc fixes
on the Cadence raw NAND controller driver and especially on the DMA
side"
* tag 'mtd/fixes-for-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: cadence: fix unchecked dereference
mtd: spi-nor: sst: Fix SST write failure
dt-bindings: mtd: cadence: document required clock-names
mtd: rawnand: qcom: fix broken config in qcom_param_page_type_exec
mtd: rawnand: cadence: fix incorrect device in dma_unmap_single
mtd: rawnand: cadence: use dma_map_resource for sdma address
mtd: rawnand: cadence: fix error code in cadence_nand_init()
Linus Torvalds [Fri, 21 Feb 2025 16:59:27 +0000 (08:59 -0800)]
Merge tag 'gpio-fixes-for-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"There are two fixes for GPIO core: one adds missing retval checks to
older code, while the second adds SRCU synchronization to legs in code
that were missed during the big rework a few cycles back. There's also
one small driver fix:
- check the return value of the get_direction() callback in struct
gpio_chip
- protect the multi-line get/set legs in GPIO core with SRCU
- fix a race condition in gpio-vf610"
* tag 'gpio-fixes-for-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: don't bail out if get_direction() fails in gpiochip_add_data()
gpiolib: protect gpio_chip with SRCU in array_info paths in multi get/set
gpio: vf610: add locking to gpio direction functions
gpiolib: check the return value of gpio_chip::get_direction()
The root cause is that s_next() returns NULL when nothing is found.
This results in s_stop() attempting to free a NULL pointer because its
parameter is NULL.
Fix the issue by freeing the memory appropriately when s_next() fails
to find anything.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250220031528.7373-1-ahuang12@lenovo.com Fixes: b355247df104 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ftrace: Correct preemption accounting for function tracing.
The function tracer should record the preemption level at the point when
the function is invoked. If the tracing subsystem decrement the
preemption counter it needs to correct this before feeding the data into
the trace buffer. This was broken in the commit cited below while
shifting the preempt-disabled section.
Use tracing_gen_ctx_dec() which properly subtracts one from the
preemption counter on a preemptible kernel.
Cc: stable@vger.kernel.org Cc: Wander Lairson Costa <wander@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de Fixes: ce5e48036c9e7 ("ftrace: disable preemption when recursion locked") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Thu, 20 Feb 2025 20:20:14 +0000 (15:20 -0500)]
selftests/ftrace: Update fprobe test to check enabled_functions file
A few bugs were found in the fprobe accounting logic along with it using
the function graph infrastructure. Update the fprobe selftest to catch
those bugs in case they or something similar shows up in the future.
The test now checks the enabled_functions file which shows all the
functions attached to ftrace or fgraph. When enabling a fprobe, make sure
that its corresponding function is also added to that file. Also add two
more fprobes to enable to make sure that the fprobe logic works properly
with multiple probes.
Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.733001756@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Thu, 20 Feb 2025 20:20:13 +0000 (15:20 -0500)]
fprobe: Fix accounting of when to unregister from function graph
When adding a new fprobe, it will update the function hash to the
functions the fprobe is attached to and register with function graph to
have it call the registered functions. The fprobe_graph_active variable
keeps track of the number of fprobes that are using function graph.
If two fprobes attach to the same function, it increments the
fprobe_graph_active for each of them. But when they are removed, the first
fprobe to be removed will see that the function it is attached to is also
used by another fprobe and it will not remove that function from
function_graph. The logic will skip decrementing the fprobe_graph_active
variable.
This causes the fprobe_graph_active variable to not go to zero when all
fprobes are removed, and in doing so it does not unregister from
function graph. As the fgraph ops hash will now be empty, and an empty
filter hash means all functions are enabled, this triggers function graph
to add a callback to the fprobe infrastructure for every function!
Steven Rostedt [Thu, 20 Feb 2025 20:20:12 +0000 (15:20 -0500)]
fprobe: Always unregister fgraph function from ops
When the last fprobe is removed, it calls unregister_ftrace_graph() to
remove the graph_ops from function graph. The issue is when it does so, it
calls return before removing the function from its graph ops via
ftrace_set_filter_ips(). This leaves the last function lingering in the
fprobe's fgraph ops and if a probe is added it also enables that last
function (even though the callback will just drop it, it does add unneeded
overhead to make that call).
The above enabled a fprobe on kernel_clone, and then on schedule_timeout.
The content of the enabled_functions shows the functions that have a
callback attached to them. The fprobe attached to those functions
properly. Then the fprobes were cleared, and enabled_functions was empty
after that. But after adding a fprobe on kmem_cache_free, the
enabled_functions shows that the schedule_timeout was attached again. This
is because it was still left in the fprobe ops that is used to tell
function graph what functions it wants callbacks from.
Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Tested-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Thu, 20 Feb 2025 20:20:11 +0000 (15:20 -0500)]
ftrace: Do not add duplicate entries in subops manager ops
Check if a function is already in the manager ops of a subops. A manager
ops contains multiple subops, and if two or more subops are tracing the
same function, the manager ops only needs a single entry in its hash.
Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org Fixes: 4f554e955614f ("ftrace: Add ftrace_set_filter_ips function") Tested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Steven Rostedt [Thu, 20 Feb 2025 20:20:10 +0000 (15:20 -0500)]
ftrace: Fix accounting of adding subops to a manager ops
Function graph uses a subops and manager ops mechanism to attach to
ftrace. The manager ops connects to ftrace and the functions it connects
to is defined by a list of subops that it manages.
The function hash that defines what the above ops attaches to limits the
functions to attach if the hash has any content. If the hash is empty, it
means to trace all functions.
The creation of the manager ops hash is done by iterating over all the
subops hashes. If any of the subops hashes is empty, it means that the
manager ops hash must trace all functions as well.
The issue is in the creation of the manager ops. When a second subops is
attached, a new hash is created by starting it as NULL and adding the
subops one at a time. But the NULL ops is mistaken as an empty hash, and
once an empty hash is found, it stops the loop of subops and just enables
all functions.
Fix this by initializing the new hash to NULL and if the hash is NULL do
not treat it as an empty hash but instead allocate by copying the content
of the first sub ops. Then on subsequent iterations, the new hash will not
be NULL, but the content of the previous subops. If that first subops
attached to all functions, then new hash may assume that the manager ops
also needs to attach to all functions.
Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org Fixes: 5fccc7552ccbc ("ftrace: Add subops logic to allow one ops to manage many") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Michael Jeanson [Wed, 19 Feb 2025 20:53:26 +0000 (15:53 -0500)]
rseq: Fix rseq registration with CONFIG_DEBUG_RSEQ
With CONFIG_DEBUG_RSEQ=y, at rseq registration the read-only fields are
copied from user-space, if this copy fails the syscall returns -EFAULT
and the registration should not be activated - but it erroneously is.
Move the activation of the registration after the copy of the fields to
fix this bug.
* patches from https://lore.kernel.org/r/20250218120209.88093-1-jefflexu@linux.alibaba.com:
mm/truncate: don't skip dirty page in folio_unmap_invalidate()
mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()
Stephan Gerhold [Tue, 18 Feb 2025 15:59:18 +0000 (16:59 +0100)]
irqchip/qcom-pdc: Workaround hardware register bug on X1E80100
On X1E80100, there is a hardware bug in the register logic of the
IRQ_ENABLE_BANK register: While read accesses work on the normal address,
all write accesses must be made to a shifted address. Without a workaround
for this, the wrong interrupt gets enabled in the PDC and it is impossible
to wakeup from deep suspend (CX collapse). This has not caused problems so
far, because the deep suspend state was not enabled. A workaround is
required now since work is ongoing to fix this.
The PDC has multiple "DRV" regions, each one has a size of 0x10000 and
provides the same set of registers for a particular client in the system.
Linux is one the clients and uses DRV region 2 on X1E. Each "bank" inside
the DRV region consists of 32 interrupt pins that can be enabled using the
IRQ_ENABLE_BANK register:
IRQ_ENABLE_BANK[bank] = base + IRQ_ENABLE_BANK + bank * sizeof(u32)
On X1E, this works as intended for read access. However, write access to
most banks is shifted by 2:
Introduce a workaround for the bug by matching the qcom,x1e80100-pdc
compatible and apply the offsets as shown above:
- Bank 0...1: previous DRV region, bank += 3
- Bank 1...4: our DRV region, bank -= 2
- Bank 5: our DRV region, no fixup required
The PDC node in the device tree only describes the DRV region for the Linux
client, but the workaround also requires to map parts of the previous DRV
region to issue writes there. To maintain compatibility with old device
trees, obtain the base address of the preceeding region by applying the
-0x10000 offset. Note that this is also more correct from a conceptual
point of view:
It does not really make use of the other region; it just issues shifted
writes that end up in the registers of the Linux associated DRV region 2.
Qu Wenruo [Tue, 18 Feb 2025 22:36:33 +0000 (09:06 +1030)]
btrfs: fix data overwriting bug during buffered write when block size < page size
[BUG]
When running generic/418 with a btrfs whose block size < page size
(subpage cases), it always fails.
And the following minimal reproducer is more than enough to trigger it
reliably:
workload()
{
mkfs.btrfs -s 4k -f $dev > /dev/null
dmesg -C
mount $dev $mnt
$fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
ret=$?
umount $mnt
stop_trace
if [ $ret -ne 0 ]; then
fail
fi
}
for (( i = 0; i < 1024; i++)); do
echo "=== $i/$runtime ==="
workload
done
[CAUSE]
With extra trace printk added to the following functions:
- btrfs_buffered_write()
* Which folio is touched
* The file offset (start) where the buffered write is at
* How many bytes are copied
* The content of the write (the first 2 bytes)
- submit_one_sector()
* Which folio is touched
* The position inside the folio
* The content of the page cache (the first 2 bytes)
- pagecache_isize_extended()
* The parameters of the function itself
* The parameters of the folio_zero_range()
The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read,
and compare the read buffer with the write buffer.
Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.
But at submit_one_sector(), at file offset 4096, the content is zeroed
out, by pagecache_isize_extended().
The race happens like this:
Thread A is writing into range [4K, 8K).
Thread B is writing into range [8K, 12k).
Thread A | Thread B
-------------------------------------+------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- old_isize = 4K; | |- old_isize = 4096;
|- btrfs_inode_lock() | |
|- write into folio range [4K, 8K) | |
|- pagecache_isize_extended() | |
| extend isize from 4096 to 8192 | |
| no folio_zero_range() called | |
|- btrfs_inode_lock() | |
| |- btrfs_inode_lock()
| |- write into folio range [8K, 12K)
| |- pagecache_isize_extended()
| | calling folio_zero_range(4K, 8K)
| | This is caused by the old_isize is
| | grabbed too early, without any
| | inode lock.
| |- btrfs_inode_unlock()
The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.
And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.
[FIX]
Grab the old i_size while holding the inode lock.
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.
CC: stable@vger.kernel.org # 5.15+ Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 17 Feb 2025 09:46:39 +0000 (20:16 +1030)]
btrfs: output an error message if btrfs failed to find the seed fsid
[BUG]
If btrfs failed to locate the seed device for whatever reason, mounting
the sprouted device will fail without any meaning error message:
# mkfs.btrfs -f /dev/test/scratch1
# btrfstune -S1 /dev/test/scratch1
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs dev add -f /dev/test/scratch2 /mnt/btrfs
# umount /mnt/btrfs
# btrfs dev scan -u
# btrfs mount /dev/test/scratch2 /mnt/btrfs
mount: /mnt/btrfs: fsconfig system call failed: No such file or directory.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n6
BTRFS info (device dm-5): first mount of filesystem 64252ded-5953-4868-b962-cea48f7ac4ea
BTRFS info (device dm-5): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device dm-5): using free-space-tree
BTRFS error (device dm-5): failed to read chunk tree: -2
BTRFS error (device dm-5): open_ctree failed: -2
[CAUSE]
The failure to mount is pretty straight forward, just unable to find the
seed device and its fsid, caused by `btrfs dev scan -u`.
But the lack of any useful info is a problem.
[FIX]
Just add an extra error message in open_seed_devices() to indicate the
error.
Now the error message would look like this:
BTRFS info (device dm-4): first mount of filesystem 7769223d-4db1-4e4c-ac29-0a96f53576ab
BTRFS info (device dm-4): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device dm-4): using free-space-tree
BTRFS error (device dm-4): failed to find fsid e87c12e6-584b-4e98-8b88-962c33a619ff when attempting to open seed devices
BTRFS error (device dm-4): failed to read chunk tree: -2
BTRFS error (device dm-4): open_ctree failed: -2
Link: https://github.com/kdave/btrfs-progs/issues/959 Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sat, 15 Feb 2025 11:11:29 +0000 (11:11 +0000)]
btrfs: do regular iput instead of delayed iput during extent map shrinking
The extent map shrinker now runs in the system unbound workqueue and no
longer in kswapd context so it can directly do an iput() on inodes even
if that blocks or needs to acquire any lock (we aren't holding any locks
when requesting the delayed iput from the shrinker). So we don't need to
add a delayed iput, wake up the cleaner and delegate the iput() to the
cleaner, which also adds extra contention on the spinlock that protects
the delayed iputs list.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sat, 15 Feb 2025 11:04:15 +0000 (11:04 +0000)]
btrfs: skip inodes without loaded extent maps when shrinking extent maps
If there are inodes that don't have any loaded extent maps, we end up
grabbing a reference on them and later adding a delayed iput, which wakes
up the cleaner and makes it do unnecessary work. This is common when for
example the inodes were open only to run stat(2) or all their extent maps
were already released through the folio release callback
(btrfs_release_folio()) or released by a previous run of the shrinker, or
directories which never have extent maps.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ CC: stable@vger.kernel.org # 6.13+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sat, 15 Feb 2025 11:36:15 +0000 (11:36 +0000)]
btrfs: fix use-after-free on inode when scanning root during em shrinking
At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a
call to btrfs_fs_closing() after we have scheduled the inode for a delayed
iput, and that can result in a use-after-free on the inode in case the
cleaner kthread does the iput before we dereference the inode in the call
to btrfs_fs_closing().
Fix this by using the fs_info stored already in a local variable instead
of doing inode->root->fs_info.
Fixes: 102044384056 ("btrfs: make the extent map shrinker run asynchronously as a work queue job") CC: stable@vger.kernel.org # 6.13+ Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Dave Airlie [Fri, 21 Feb 2025 00:50:28 +0000 (10:50 +1000)]
Merge tag 'drm-msm-fixes-2025-02-20' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v6.14-rc4
Display:
* More catalog fixes:
- to skip watchdog programming through top block if its not present
- fix the setting of WB mask to ensure the WB input control is programmed
correctly through ping-pong
- drop lm_pair for sm6150 as that chipset does not have any 3dmerge block
* Fix the mode validation logic for DP/eDP to account for widebus (2ppc)
to allow high clock resolutions
* Fix to disable dither during encoder disable as otherwise this was
causing kms_writeback failure due to resource sharing between
* WB and DSI paths as DSI uses dither but WB does not
* Fixes for virtual planes, namely to drop extraneous return and fix
uninitialized variables
* Fix to avoid spill-over of DSC encoder block bits when programming
the bits-per-component
* Fixes in the DSI PHY to protect against concurrent access of
PHY_CMN_CLK_CFG regs between clock and display drivers
Core/GPU:
* Fix non-blocking fence wait incorrectly rounding up to 1 jiffy timeout
* Only print GMU fw version once, instead of each time the GPU resumes
Dave Airlie [Fri, 21 Feb 2025 00:44:53 +0000 (10:44 +1000)]
Merge tag 'drm-intel-fixes-2025-02-20' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Use spin_lock_irqsave() in interruptible context on guc submission (Krzysztof)
- Fixes on DDI and TRANS programming (Imre)
- Make sure all planes in use by the joiner have their crtc included (Ville)
- Fix 128b/132b modeset issues (Imre)
Jens Axboe [Fri, 21 Feb 2025 00:43:59 +0000 (17:43 -0700)]
Merge tag 'nvme-6.14-2025-02-20' of git://git.infradead.org/nvme into block-6.14
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.14
- FC controller state check fixes (Daniel)
- PCI Endpoint fixes (Damien)
- TCP connection failure fixe (Caleb)
- TCP handling C2HTermReq PDU (Maurizio)
- RDMA queue state check (Ruozhu)
- Apple controller fixes (Hector)
- Target crash on disbaled namespace (Hannes)"
* tag 'nvme-6.14-2025-02-20' of git://git.infradead.org/nvme:
nvme: only allow entering LIVE from CONNECTING state
nvme-fc: rely on state transitions to handle connectivity loss
apple-nvme: Support coprocessors left idle
apple-nvme: Release power domains when probe fails
nvmet: Use enum definitions instead of hardcoded values
nvme: Cleanup the definition of the controller config register fields
nvme/ioctl: add missing space in err message
nvme-tcp: fix connect failure on receiving partial ICResp PDU
nvme: tcp: Fix compilation warning with W=1
nvmet: pci-epf: Avoid RCU stalls under heavy workload
nvmet: pci-epf: Do not uselessly write the CSTS register
nvmet: pci-epf: Correctly initialize CSTS when enabling the controller
nvmet-rdma: recheck queue state is LIVE in state lock in recv done
nvmet: Fix crash when a namespace is disabled
nvme-tcp: add basic support for the C2HTermReq PDU
nvme-pci: quirk Acer FA100 for non-uniqueue identifiers
Linus Torvalds [Thu, 20 Feb 2025 23:37:17 +0000 (15:37 -0800)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull BPF fixes from Daniel Borkmann:
- Fix a soft-lockup in BPF arena_map_free on 64k page size kernels
(Alan Maguire)
- Fix a missing allocation failure check in BPF verifier's
acquire_lock_state (Kumar Kartikeya Dwivedi)
- Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb
to the raw_tp_null_args set (Kuniyuki Iwashima)
- Fix a deadlock when freeing BPF cgroup storage (Abel Wu)
- Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex
(Andrii Nakryiko)
- Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is
accessing skb data not containing an Ethernet header (Shigeru
Yoshida)
- Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai)
- Several BPF sockmap fixes to address incorrect TCP copied_seq
calculations, which prevented correct data reads from recv(2) in user
space (Jiayuan Chen)
- Two fixes for BPF map lookup nullness elision (Daniel Xu)
- Fix a NULL-pointer dereference from vmlinux BTF lookup in
bpf_sk_storage_tracing_allowed (Jared Kangas)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests: bpf: test batch lookup on array of maps with holes
bpf: skip non exist keys in generic_map_lookup_batch
bpf: Handle allocation failure in acquire_lock_state
bpf: verifier: Disambiguate get_constant_map_key() errors
bpf: selftests: Test constant key extraction on irrelevant maps
bpf: verifier: Do not extract constant map keys for irrelevant maps
bpf: Fix softlockup in arena_map_free on 64k page kernel
net: Add rx_skb of kfree_skb to raw_tp_null_args[].
bpf: Fix deadlock when freeing cgroup storage
selftests/bpf: Add strparser test for bpf
selftests/bpf: Fix invalid flag of recv()
bpf: Disable non stream socket for strparser
bpf: Fix wrong copied_seq calculation
strparser: Add read_sock callback
bpf: avoid holding freeze_mutex during mmap operation
bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic
selftests/bpf: Adjust data size to have ETH_HLEN
bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type()
bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
Dave Airlie [Thu, 20 Feb 2025 23:16:18 +0000 (09:16 +1000)]
Merge tag 'drm-misc-fixes-2025-02-20' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
An reset signal polarity fix for the jd9365da-h3 panel, a folio handling
fix and config fix in nouveau, a dmem cgroup descendant pool handling
fix, and a missing header for amdxdna.
Arnd Bergmann [Thu, 20 Feb 2025 21:28:28 +0000 (22:28 +0100)]
Merge tag 'scmi-fix-6.14' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
Arm SCMI fix for v6.14
Just a single fix to address the incorrect size of the Tx buffer in the
function scmi_imx_misc_ctrl_set() which is part of NXP/i.MX SCMI vendor
extensions.
* tag 'scmi-fix-6.14' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: imx: Correct tx size of scmi_imx_misc_ctrl_set
- tcp:
- adjust rcvq_space after updating scaling ratio
- drop secpath at the same time as we currently drop dst
- eth: gtp: suppress list corruption splat in gtp_net_exit_batch_rtnl().
Previous releases - always broken:
- vsock:
- fix variables initialization during resuming
- for connectible sockets allow only connected
- eth:
- geneve: fix use-after-free in geneve_find_dev()
- ibmvnic: don't reference skb after sending to VIOS"
* tag 'net-6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
Revert "net: skb: introduce and use a single page frag cache"
net: allow small head cache usage with large MAX_SKB_FRAGS values
nfp: bpf: Add check for nfp_app_ctrl_msg_alloc()
tcp: drop secpath at the same time as we currently drop dst
net: axienet: Set mac_managed_pm
arp: switch to dev_getbyhwaddr() in arp_req_set_public()
net: Add non-RCU dev_getbyhwaddr() helper
sctp: Fix undefined behavior in left shift operation
selftests/bpf: Add a specific dst port matching
flow_dissector: Fix port range key handling in BPF conversion
selftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range
flow_dissector: Fix handling of mixed port and port-range keys
geneve: Suppress list corruption splat in geneve_destroy_tunnels().
gtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().
dev: Use rtnl_net_dev_lock() in unregister_netdev().
net: Fix dev_net(dev) race in unregister_netdevice_notifier_dev_net().
net: Add net_passive_inc() and net_passive_dec().
net: pse-pd: pd692x0: Fix power limit retrieval
MAINTAINERS: trim the GVE entry
gve: set xdp redirect target only when it is available
...
Haoxiang Li [Mon, 17 Feb 2025 07:20:38 +0000 (15:20 +0800)]
smb: client: Add check for next_buffer in receive_encrypted_standard()
Add check for the return value of cifs_buf_get() and cifs_small_buf_get()
in receive_encrypted_standard() to prevent null pointer dereference.
Fixes: eec04ea11969 ("smb: client: fix OOB in receive_encrypted_standard()") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Daniel Wagner [Fri, 14 Feb 2025 08:02:03 +0000 (09:02 +0100)]
nvme: only allow entering LIVE from CONNECTING state
The fabric transports and also the PCI transport are not entering the
LIVE state from NEW or RESETTING. This makes the state machine more
restrictive and allows to catch not supported state transitions, e.g.
directly switching from RESETTING to LIVE.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Daniel Wagner [Fri, 14 Feb 2025 08:02:04 +0000 (09:02 +0100)]
nvme-fc: rely on state transitions to handle connectivity loss
It's not possible to call nvme_state_ctrl_state with holding a spin
lock, because nvme_state_ctrl_state calls cancel_delayed_work_sync
when fastfail is enabled.
Instead syncing the ASSOC_FLAG and state transitions using a lock, it's
possible to only rely on the state machine transitions. That means
nvme_fc_ctrl_connectivity_loss should unconditionally call
nvme_reset_ctrl which avoids the read race on the ctrl state variable.
Actually, it's not necessary to test in which state the ctrl is, the
reset work will only scheduled when the state machine is in LIVE state.
In nvme_fc_create_association, the LIVE state can only be entered if it
was previously CONNECTING. If this is not possible then the reset
handler got triggered. Thus just error out here.
Fixes: ee59e3820ca9 ("nvme-fc: do not ignore connectivity loss during connecting") Closes: https://lore.kernel.org/all/denqwui6sl5erqmz2gvrwueyxakl5txzbbiu3fgebryzrfxunm@iwxuthct377m/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Linus Torvalds [Thu, 20 Feb 2025 16:59:00 +0000 (08:59 -0800)]
Merge tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix for chmod regression
- Two reparse point related fixes
- One minor cleanup (for GCC 14 compiles)
- Fix for SMB3.1.1 POSIX Extensions reporting incorrect file type
* tag 'v6.14-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Treat unhandled directory name surrogate reparse points as mount directory nodes
cifs: Throw -EOPNOTSUPP error on unsupported reparse point type from parse_reparse_point()
smb311: failure to open files of length 1040 when mounting with SMB3.1.1 POSIX extensions
smb: client, common: Avoid multiple -Wflex-array-member-not-at-end warnings
smb: client: fix chmod(2) regression with ATTR_READONLY
Linus Torvalds [Thu, 20 Feb 2025 16:51:57 +0000 (08:51 -0800)]
Merge tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Small stuff:
- The fsck code for Hongbo's directory i_size patch was wrong, caught
by transaction restart injection: we now have the CI running
another test variant with restart injection enabled
- Another fixup for reflink pointers to missing indirect extents:
previous fix was for fsck code, this fixes the normal runtime paths
- Another small srcu lock hold time fix, reported by jpsollie"
* tag 'bcachefs-2025-02-20' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix srcu lock warning in btree_update_nodes_written()
bcachefs: Fix bch2_indirect_extent_missing_error()
bcachefs: Fix fsck directory i_size checking
Linus Torvalds [Thu, 20 Feb 2025 16:48:55 +0000 (08:48 -0800)]
Merge tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"Just a collection of bug fixes, nothing really stands out"
* tag 'xfs-fixes-6.14-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: flush inodegc before swapon
xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate
xfs: Do not allow norecovery mount with quotacheck
xfs: do not check NEEDSREPAIR if ro,norecovery mount.
xfs: fix data fork format filtering during inode repair
xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n
Kan Liang [Wed, 19 Feb 2025 14:10:05 +0000 (06:10 -0800)]
perf/x86/intel: Fix event constraints for LNC
According to the latest event list, update the event constraint tables
for Lion Cove core.
The general rule (the event codes < 0x90 are restricted to counters
0-3.) has been removed. There is no restriction for most of the
performance monitoring events.
Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support") Reported-by: Amiri Khalil <amiri.khalil@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250219141005.2446823-1-kan.liang@linux.intel.com
Miklos Szeredi [Thu, 20 Feb 2025 10:02:58 +0000 (11:02 +0100)]
fuse: don't truncate cached, mutated symlink
Fuse allows the value of a symlink to change and this property is exploited
by some filesystems (e.g. CVMFS).
It has been observed, that sometimes after changing the symlink contents,
the value is truncated to the old size.
This is caused by fuse_getattr() racing with fuse_reverse_inval_inode().
fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which
results in fuse_change_attributes() exiting before updating the cached
attributes
This is okay, as the cached attributes remain invalid and the next call to
fuse_change_attributes() will likely update the inode with the correct
values.
The reason this causes problems is that cached symlinks will be
returned through page_get_link(), which truncates the symlink to
inode->i_size. This is correct for filesystems that don't mutate
symlinks, but in this case it causes bad behavior.
The solution is to just remove this truncation. This can cause a
regression in a filesystem that relies on supplying a symlink larger than
the file size, but this is unlikely. If that happens we'd need to make
this behavior conditional.
Reported-by: Laura Promberger <laura.promberger@cern.ch> Tested-by: Sam Lewis <samclewis@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
gpiolib: don't bail out if get_direction() fails in gpiochip_add_data()
Since commit 9d846b1aebbe ("gpiolib: check the return value of
gpio_chip::get_direction()") we check the return value of the
get_direction() callback as per its API contract. Some drivers have been
observed to fail to register now as they may call get_direction() in
gpiochip_add_data() in contexts where it has always silently failed.
Until we audit all drivers, replace the bail-out to a kernel log
warning.
Fixes: 9d846b1aebbe ("gpiolib: check the return value of gpio_chip::get_direction()") Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/all/Z7VFB1nST6lbmBIo@finisterre.sirena.org.uk/ Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Closes: https://lore.kernel.org/all/dfe03f88-407e-4ef1-ad30-42db53bbd4e4@samsung.com/ Tested-by: Mark Brown <broonie@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250219144356.258635-1-brgl@bgdev.pl Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
====================
net: remove the single page frag cache for good
This is another attempt at reverting commit dbae2b062824 ("net: skb:
introduce and use a single page frag cache"), as it causes regressions
in specific use-cases.
Reverting such commit uncovers an allocation issue for build with
CONFIG_MAX_SKB_FRAGS=45, as reported by Sabrina.
This series handle the latter in patch 1 and brings the revert in patch
2.
Note that there is a little chicken-egg problem, as I included into the
patch 1's changelog the splat that would be visible only applying first
the revert: I think current patch order is better for bisectability,
still the splat is useful for correct attribution.
====================
Paolo Abeni [Tue, 18 Feb 2025 18:29:40 +0000 (19:29 +0100)]
Revert "net: skb: introduce and use a single page frag cache"
After the previous commit is finally safe to revert commit dbae2b062824
("net: skb: introduce and use a single page frag cache"): do it here.
The intended goal of such change was to counter a performance regression
introduced by commit 3226b158e67c ("net: avoid 32 x truesize
under-estimation for tiny skbs").
Unfortunately, the blamed commit introduces another regression for the
virtio_net driver. Such a driver calls napi_alloc_skb() with a tiny
size, so that the whole head frag could fit a 512-byte block.
The single page frag cache uses a 1K fragment for such allocation, and
the additional overhead, under small UDP packets flood, makes the page
allocator a bottleneck.
Thanks to commit bf9f1baa279f ("net: add dedicated kmem_cache for
typical/small skb->head"), this revert does not re-introduce the
original regression. Actually, in the relevant test on top of this
revert, I measure a small but noticeable positive delta, just above
noise level.
The revert itself required some additional mangling due to recent updates
in the affected code.
Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: dbae2b062824 ("net: skb: introduce and use a single page frag cache") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
on kernel built with MAX_SKB_FRAGS=45, where SKB_WITH_OVERHEAD(1024)
is smaller than GRO_MAX_HEAD.
Such built additionally contains the revert of the single page frag cache
so that napi_get_frags() ends up using the page frag allocator, triggering
the splat.
Note that the underlying issue is independent from the mentioned
revert; address it ensuring that the small head cache will fit either TCP
and GRO allocation and updating napi_alloc_skb() and __netdev_alloc_skb()
to select kmalloc() usage for any allocation fitting such cache.
Reported-by: Sabrina Dubroca <sd@queasysnail.net> Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: 3948b05950fd ("net: introduce a config option to tweak MAX_SKB_FRAGS") Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sabrina Dubroca [Mon, 17 Feb 2025 10:23:35 +0000 (11:23 +0100)]
tcp: drop secpath at the same time as we currently drop dst
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
- create a pair of netns
- run a basic TCP test over ipcomp6
- delete the pair of netns
The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.
The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.
We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.
Nick Hu [Mon, 17 Feb 2025 05:58:42 +0000 (13:58 +0800)]
net: axienet: Set mac_managed_pm
The external PHY will undergo a soft reset twice during the resume process
when it wake up from suspend. The first reset occurs when the axienet
driver calls phylink_of_phy_connect(), and the second occurs when
mdio_bus_phy_resume() invokes phy_init_hw(). The second soft reset of the
external PHY does not reinitialize the internal PHY, which causes issues
with the internal PHY, resulting in the PHY link being down. To prevent
this, setting the mac_managed_pm flag skips the mdio_bus_phy_resume()
function.
Fixes: a129b41fe0a8 ("Revert "net: phy: dp83867: perform soft reset and retain established link"") Signed-off-by: Nick Hu <nick.hu@sifive.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250217055843.19799-1-nick.hu@sifive.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net: core: improvements to device lookup by hardware address.
The first patch adds a new dev_getbyhwaddr() helper function for
finding devices by hardware address when the rtnl lock is held. This
prevents PROVE_LOCKING warnings that occurred when rtnl lock was held
but the RCU read lock wasn't. The common address comparison logic is
extracted into dev_comp_addr() to avoid code duplication.
The second coverts arp_req_set_public() to the new helper.
====================
Breno Leitao [Tue, 18 Feb 2025 13:49:31 +0000 (05:49 -0800)]
arp: switch to dev_getbyhwaddr() in arp_req_set_public()
The arp_req_set_public() function is called with the rtnl lock held,
which provides enough synchronization protection. This makes the RCU
variant of dev_getbyhwaddr() unnecessary. Switch to using the simpler
dev_getbyhwaddr() function since we already have the required rtnl
locking.
This change helps maintain consistency in the networking code by using
the appropriate helper function for the existing locking context.
Since we're not holding the RCU read lock in arp_req_set_public()
existing code could trigger false positive locking warnings.
Fixes: 941666c2e3e0 ("net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()") Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-2-d3d6892db9e1@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Breno Leitao [Tue, 18 Feb 2025 13:49:30 +0000 (05:49 -0800)]
net: Add non-RCU dev_getbyhwaddr() helper
Add dedicated helper for finding devices by hardware address when
holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents
PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not.
Extract common address comparison logic into dev_addr_cmp().
The context about this change could be found in the following
discussion:
Yu-Chun Lin [Tue, 18 Feb 2025 08:12:16 +0000 (16:12 +0800)]
sctp: Fix undefined behavior in left shift operation
According to the C11 standard (ISO/IEC 9899:2011, 6.5.7):
"If E1 has a signed type and E1 x 2^E2 is not representable in the result
type, the behavior is undefined."
Shifting 1 << 31 causes signed integer overflow, which leads to undefined
behavior.
Fix this by explicitly using '1U << 31' to ensure the shift operates on
an unsigned type, avoiding undefined behavior.
====================
flow_dissector: Fix handling of mixed port and port-range keys
This patchset contains two fixes for flow_dissector handling of mixed
port and port-range keys, for both tc-flower case and bpf case. Each
of them also comes with a selftest.
====================
Cong Wang [Tue, 18 Feb 2025 04:32:09 +0000 (20:32 -0800)]
flow_dissector: Fix port range key handling in BPF conversion
Fix how port range keys are handled in __skb_flow_bpf_to_target() by:
- Separating PORTS and PORTS_RANGE key handling
- Using correct key_ports_range structure for range keys
- Properly initializing both key types independently
This ensures port range information is correctly stored in its dedicated
structure rather than incorrectly using the regular ports key structure.
Fixes: 59fb9b62fb6c ("flow_dissector: Fix to use new variables for port ranges in bpf hook") Reported-by: Qiang Zhang <dtzq01@gmail.com> Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/ Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Link: https://patch.msgid.link/20250218043210.732959-4-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Tue, 18 Feb 2025 04:32:08 +0000 (20:32 -0800)]
selftests/net/forwarding: Add a test case for tc-flower of mixed port and port-range
After this patch:
# ./tc_flower_port_range.sh
TEST: Port range matching - IPv4 UDP [ OK ]
TEST: Port range matching - IPv4 TCP [ OK ]
TEST: Port range matching - IPv6 UDP [ OK ]
TEST: Port range matching - IPv6 TCP [ OK ]
TEST: Port range matching - IPv4 UDP Drop [ OK ]
Cong Wang [Tue, 18 Feb 2025 04:32:07 +0000 (20:32 -0800)]
flow_dissector: Fix handling of mixed port and port-range keys
This patch fixes a bug in TC flower filter where rules combining a
specific destination port with a source port range weren't working
correctly.
The specific case was when users tried to configure rules like:
tc filter add dev ens38 ingress protocol ip flower ip_proto udp \
dst_port 5000 src_port 2000-3000 action drop
The root cause was in the flow dissector code. While both
FLOW_DISSECTOR_KEY_PORTS and FLOW_DISSECTOR_KEY_PORTS_RANGE flags
were being set correctly in the classifier, the __skb_flow_dissect_ports()
function was only populating one of them: whichever came first in
the enum check. This meant that when the code needed both a specific
port and a port range, one of them would be left as 0, causing the
filter to not match packets as expected.
Fix it by removing the either/or logic and instead checking and
populating both key types independently when they're in use.
Fixes: 8ffb055beae5 ("cls_flower: Fix the behavior using port ranges with hw-offload") Reported-by: Qiang Zhang <dtzq01@gmail.com> Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/ Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250218043210.732959-2-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>