25960 Commits

Author SHA1 Message Date
Enze Li
9a2791e748 mm/damon: unify address range representation with damon_addr_range
Currently, DAMON defines two identical structures for representing address
ranges: damon_system_ram_region and damon_addr_range.  Both structures
share the same semantic interpretation of a half-open interval [start,
end), where the start address is inclusive and the end address is
exclusive.

This duplication adds unnecessary redundancy and increases maintenance
overhead.  This patch replaces all uses of damon_system_ram_region with
the more generic damon_addr_range structure, ensuring a unified type
representation for address ranges within the DAMON subsystem.  The change
simplifies the codebase, improves readability, and avoids potential
inconsistencies in future modifications.

Link: https://lkml.kernel.org/r/20260129100845.281734-1-lienze@kylinos.cn
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:15 -08:00
Thorsten Blum
ad789a85b1 mm/cma: replace snprintf with strscpy in cma_new_area
Replace snprintf("%s", ...) with the faster and more direct strscpy().

Link: https://lkml.kernel.org/r/20260126174516.236968-1-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:15 -08:00
Yosry Ahmed
e2c3b6b21c mm: zswap: use SG list decompression APIs from zsmalloc
Use the new zs_obj_read_sg_*() APIs in zswap_decompress(), instead of
zs_obj_read_*() APIs returning a linear address.  The SG list is passed
directly to the crypto API, simplifying the logic and dropping the
workaround that copies highmem addresses to a buffer.  The crypto API
should internally linearize the SG list if needed.

This avoids the memcpy() in zsmalloc for objects spanning multiple pages,
although an equivalent operation will be done internally by acomp/scomp. 
However, in the future compression algorithms could support handling
discontiguous SG lists, completely eliminating the copying for spanning
objects.

Zsmalloc fills an SG list up to 2 entries in size, so change the input SG
list to fit 2 entries.

Update the incompressible entries path to use memcpy_from_sglist() to copy
the data to the folio.  Opportunistically set dlen to PAGE_SIZE in the
same code path (rather that at the top of the function) to make it
clearer.

Drop the goto in zswap_compress() as the code now is not simple enough for
an if-else statement instead.  Rename 'decomp_ret' to 'ret' and reuse it
to keep the intermediate return value of crypto_acomp_decompress() to keep
line lengths manageable.

No functional change intended.

Link: https://lkml.kernel.org/r/20260121013615.2906368-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-06 15:47:14 -08:00
Linus Torvalds
3dc58c9ce1 Merge tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
 "A couple of late-breaking MM fixes. One against a new-in-this-cycle
  patch and the other addresses a locking issue which has been there for
  over a year"

* tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/memory-failure: reject unsupported non-folio compound page
  procfs: avoid fetching build ID while holding VMA lock
2026-02-06 13:07:47 -08:00
Linus Torvalds
5ca98c22b5 Merge tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
 "A stable fix for memory allocation profiling tag not being cleared
  when aborting an allocation due to memcg charge failure (Hao Ge)"

* tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single
2026-02-06 09:56:03 -08:00
Joerg Roedel
ad09563660 Merge branches 'fixes', 'arm/smmu/updates', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2026-02-06 11:10:40 +01:00
Hao Li
98e99fc4ad slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set
SLAB_NO_OBJ_EXT is set for boot caches, but need_slab_obj_exts() doesn't
check this flag. We should return false unconditionally when
SLAB_NO_OBJ_EXT is set.

Signed-off-by: Hao Li <hao.li@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260205120709.425719-1-hao.li@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-06 10:39:36 +01:00
Hao Ge
e6c53ead2d mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single
When CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the following warning
may be noticed:

[ 3959.023862] ------------[ cut here ]------------
[ 3959.023891] alloc_tag was not cleared (got tag for lib/xarray.c:378)
[ 3959.023947] WARNING: ./include/linux/alloc_tag.h:155 at alloc_tag_add+0x128/0x178, CPU#6: mkfs.ntfs/113998
[ 3959.023978] Modules linked in: dns_resolver tun brd overlay exfat btrfs blake2b libblake2b xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel ext4 crc16 mbcache jbd2 rfkill sunrpc vfat fat sg fuse nfnetlink sr_mod virtio_gpu cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper ghash_ce drm sm4 backlight virtio_net net_failover virtio_scsi failover virtio_console virtio_blk virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod i2c_dev aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject]
[ 3959.024170] CPU: 6 UID: 0 PID: 113998 Comm: mkfs.ntfs Kdump: loaded Tainted: G        W           6.19.0-rc7+ #7 PREEMPT(voluntary)
[ 3959.024182] Tainted: [W]=WARN
[ 3959.024186] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
[ 3959.024192] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3959.024199] pc : alloc_tag_add+0x128/0x178
[ 3959.024207] lr : alloc_tag_add+0x128/0x178
[ 3959.024214] sp : ffff80008b696d60
[ 3959.024219] x29: ffff80008b696d60 x28: 0000000000000000 x27: 0000000000000240
[ 3959.024232] x26: 0000000000000000 x25: 0000000000000240 x24: ffff800085d17860
[ 3959.024245] x23: 0000000000402800 x22: ffff0000c0012dc0 x21: 00000000000002d0
[ 3959.024257] x20: ffff0000e6ef3318 x19: ffff800085ae0410 x18: 0000000000000000
[ 3959.024269] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 3959.024281] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600064101293
[ 3959.024292] x11: 1fffe00064101292 x10: ffff600064101292 x9 : dfff800000000000
[ 3959.024305] x8 : 00009fff9befed6e x7 : ffff000320809493 x6 : 0000000000000001
[ 3959.024316] x5 : ffff000320809490 x4 : ffff600064101293 x3 : ffff800080691838
[ 3959.024328] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000d5bcd640
[ 3959.024340] Call trace:
[ 3959.024346]  alloc_tag_add+0x128/0x178 (P)
[ 3959.024355]  __alloc_tagging_slab_alloc_hook+0x11c/0x1a8
[ 3959.024362]  kmem_cache_alloc_lru_noprof+0x1b8/0x5e8
[ 3959.024369]  xas_alloc+0x304/0x4f0
[ 3959.024381]  xas_create+0x1e0/0x4a0
[ 3959.024388]  xas_store+0x68/0xda8
[ 3959.024395]  __filemap_add_folio+0x5b0/0xbd8
[ 3959.024409]  filemap_add_folio+0x16c/0x7e0
[ 3959.024416]  __filemap_get_folio_mpol+0x2dc/0x9e8
[ 3959.024424]  iomap_get_folio+0xfc/0x180
[ 3959.024435]  __iomap_get_folio+0x2f8/0x4b8
[ 3959.024441]  iomap_write_begin+0x198/0xc18
[ 3959.024448]  iomap_write_iter+0x2ec/0x8f8
[ 3959.024454]  iomap_file_buffered_write+0x19c/0x290
[ 3959.024461]  blkdev_write_iter+0x38c/0x978
[ 3959.024470]  vfs_write+0x4d4/0x928
[ 3959.024482]  ksys_write+0xfc/0x1f8
[ 3959.024489]  __arm64_sys_write+0x74/0xb0
[ 3959.024496]  invoke_syscall+0xd4/0x258
[ 3959.024507]  el0_svc_common.constprop.0+0xb4/0x240
[ 3959.024514]  do_el0_svc+0x48/0x68
[ 3959.024520]  el0_svc+0x40/0xf8
[ 3959.024526]  el0t_64_sync_handler+0xa0/0xe8
[ 3959.024533]  el0t_64_sync+0x1ac/0x1b0
[ 3959.024540] ---[ end trace 0000000000000000 ]---

When __memcg_slab_post_alloc_hook() fails, there are two different
free paths depending on whether size == 1 or size != 1. In the
kmem_cache_free_bulk() path, we do call alloc_tagging_slab_free_hook().
However, in memcg_alloc_abort_single() we don't, the above warning will be
triggered on the next allocation.

Therefore, add alloc_tagging_slab_free_hook() to the
memcg_alloc_abort_single() path.

Fixes: 9f9796b413 ("mm, slab: move memcg charging to post-alloc hook")
Cc: stable@vger.kernel.org
Suggested-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Hao Ge <hao.ge@linux.dev>
Reviewed-by: Hao Li <hao.li@linux.dev>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260204101401.202762-1-hao.ge@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-06 09:51:08 +01:00
Miaohe Lin
ae9fd76c11 mm/memory-failure: reject unsupported non-folio compound page
When !CONFIG_TRANSPARENT_HUGEPAGE, a non-folio compound page can appear in
a userspace mapping via either vm_insert_*() functions or
vm_operatios_struct->fault().  They are not folios, thus should not be
considered for folio operations like split.  To reject these pages, make
sure get_hwpoison_page() is always called as HWPoisonHandlable() will do
the right work.

[Some commit log borrowed from Zi Yan. Thanks.]

Link: https://lkml.kernel.org/r/20260205075328.523211-1-linmiaohe@huawei.com
Fixes: 689b898677 ("mm/memory-failure: improve large block size folio handling")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: 是参差 <shicenci@gmail.com>
Closes: https://lore.kernel.org/all/PS1PPF7E1D7501F1E4F4441E7ECD056DEADAB98A@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-05 14:10:00 -08:00
Linus Torvalds
b20624608f Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "Five hotfixes.  Two are cc:stable, two are for MM.

  All are singletons - please see the changelogs for details"

* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Documentation: document liveupdate cmdline parameter
  mm, shmem: prevent infinite loop on truncate race
  mailmap: update Alexander Mikhalitsyn's emails
  liveupdate: luo_file: do not clear serialized_data on unfreeze
  x86/kfence: fix booting on 32bit non-PAE systems
2026-02-04 16:04:00 -08:00
Claudio Imbrenda
728b0e21b4 KVM: S390: Remove PGSTE code from linux/s390 mm
Remove the PGSTE config option.
Remove all code from linux/s390 mm that involves PGSTEs.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2026-02-04 17:00:10 +01:00
Harry Yoo
2f35fee943 mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches
While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account
slab objects, it prevents slab merging because merging can change
the metadata layout.

As pointed out Vlastimil Babka, disabling merging solely for this memory
optimization may not be a net win, because disabling slab merging tends
to increase overall memory usage.

Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for
other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU).

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:36 +01:00
Harry Yoo
a77d6d3386 mm/slab: place slabobj_ext metadata in unused space within s->size
When a cache has high s->align value and s->object_size is not aligned
to it, each object ends up with some unused space because of alignment.
If this wasted space is big enough, we can use it to store the
slabobj_ext metadata instead of wasting it.

On my system, this happens with caches like kmem_cache, mm_struct, pid,
task_struct, sighand_cache, xfs_inode, and others.

To place the slabobj_ext metadata within each object, the existing
slab_obj_ext() logic can still be used by setting:

  - slab->obj_exts = slab_address(slab) + (slabobj_ext offset)
  - stride = s->size

slab_obj_ext() doesn't need know where the metadata is stored,
so this method works without adding extra overhead to slab_obj_ext().

A good example benefiting from this optimization is xfs_inode
(object_size: 992, align: 64). To measure memory savings, 2 millions of
files were created on XFS.

[ MEMCG=y, MEM_ALLOC_PROFILING=n ]

Before patch (creating ~2.64M directories on xfs):
  Slab:            5175976 kB
  SReclaimable:    3837524 kB
  SUnreclaim:      1338452 kB

After patch (creating ~2.64M directories on xfs):
  Slab:            5152912 kB
  SReclaimable:    3838568 kB
  SUnreclaim:      1314344 kB (-23.54 MiB)

Enjoy the memory savings!

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-10-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:36 +01:00
Harry Yoo
fab0694646 mm/slab: move [__]ksize and slab_ksize() to mm/slub.c
To access SLUB's internal implementation details beyond cache flags in
ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c.

[vbabka@suse.cz: also make __ksize() static and move its kerneldoc to
 ksize() ]

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
70089d0188 mm/slab: save memory by allocating slabobj_ext array from leftover
The leftover space in a slab is always smaller than s->size, and
kmem caches for large objects that are not power-of-two sizes tend to have
a greater amount of leftover space per slab. In some cases, the leftover
space is larger than the size of the slabobj_ext array for the slab.

An excellent example of such a cache is ext4_inode_cache. On my system,
the object size is 1136, with a preferred order of 3, 28 objects per slab,
and 960 bytes of leftover space per slab.

Since the size of the slabobj_ext array is only 224 bytes (w/o mem
profiling) or 448 bytes (w/ mem profiling) per slab, the entire array
fits within the leftover space.

Allocate the slabobj_exts array from this unused space instead of using
kcalloc() when it is large enough. The array is allocated from unused
space only when creating new slabs, and it doesn't try to utilize unused
space if alloc_slab_obj_exts() is called after slab creation because
implementing lazy allocation involves more expensive synchronization.

The implementation and evaluation of lazy allocation from unused space
is left as future-work. As pointed by Vlastimil Babka [1], it could be
beneficial when a slab cache without SLAB_ACCOUNT can be created, and
some of the allocations from the cache use __GFP_ACCOUNT. For example,
xarray does that.

To avoid unnecessary overhead when MEMCG (with SLAB_ACCOUNT) and
MEM_ALLOC_PROFILING are not used for the cache, allocate the slabobj_ext
array only when either of them is enabled on slab allocation.

[ MEMCG=y, MEM_ALLOC_PROFILING=n ]

Before patch (creating ~2.64M directories on ext4):
  Slab:            4747880 kB
  SReclaimable:    4169652 kB
  SUnreclaim:       578228 kB

After patch (creating ~2.64M directories on ext4):
  Slab:            4724020 kB
  SReclaimable:    4169188 kB
  SUnreclaim:       554832 kB (-22.84 MiB)

Enjoy the memory savings!

Link: https://lore.kernel.org/linux-mm/48029aab-20ea-4d90-bfd1-255592b2018e@suse.cz [1]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-8-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
4b1530f89c mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison
In the near future, slabobj_ext may reside outside the allocated slab
object range within a slab, which could be reported as an out-of-bounds
access by KASAN.

As suggested by Andrey Konovalov [1], explicitly disable KASAN and KMSAN
checks when accessing slabobj_ext within slab allocator, memory profiling,
and memory cgroup code. While an alternative approach could be to unpoison
slabobj_ext, out-of-bounds accesses outside the slab allocator are
generally more common.

Move metadata_access_enable()/disable() helpers to mm/slab.h so that
it can be used outside mm/slub.c. However, as suggested by Suren
Baghdasaryan [2], instead of calling them directly from mm code (which is
more prone to errors), change users to access slabobj_ext via get/put
APIs:

  - Users should call get_slab_obj_exts() to access slabobj_metadata
    and call put_slab_obj_exts() when it's done.

  - From now on, accessing it outside the section covered by
    get_slab_obj_exts() ~ put_slab_obj_exts() is illegal.
    This ensures that accesses to slabobj_ext metadata won't be reported
    as access violations.

Call kasan_reset_tag() in slab_obj_ext() before returning the address to
prevent SW or HW tag-based KASAN from reporting false positives.

Suggested-by: Andrey Konovalov <andreyknvl@gmail.com>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/linux-mm/CA+fCnZezoWn40BaS3cgmCeLwjT+5AndzcQLc=wH3BjMCu6_YCw@mail.gmail.com [1]
Link: https://lore.kernel.org/linux-mm/CAJuCfpG=Lb4WhYuPkSpdNO4Ehtjm1YcEEK0OM=3g9i=LxmpHSQ@mail.gmail.com [2]
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-7-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
7a8e71bc61 mm/slab: use stride to access slabobj_ext
Use a configurable stride value when accessing slab object extension
metadata instead of assuming a fixed sizeof(struct slabobj_ext).

Store stride value in free bits of slab->counters field. This allows
for flexibility in cases where the extension is embedded within
slab objects.

Since these free bits exist only on 64-bit, any future optimizations
that need to change stride value cannot be enabled on 32-bit architectures.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-6-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
52f1ca8a45 mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper
Currently, the slab allocator assumes that slab->obj_exts is a pointer
to an array of struct slabobj_ext objects. However, to support storage
methods where struct slabobj_ext is embedded within objects, the slab
allocator should not make this assumption. Instead of directly
dereferencing the slabobj_exts array, abstract access to
struct slabobj_ext via helper functions.

Introduce a new API slabobj_ext metadata access:

  slab_obj_ext(slab, obj_exts, index) - returns the pointer to
  struct slabobj_ext element at the given index.

Directly dereferencing the return value of slab_obj_exts() is no longer
allowed. Instead, slab_obj_ext() must always be used to access
individual struct slabobj_ext objects.

Convert all users to use these APIs.
No functional changes intended.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-5-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
a13b68d79d mm/slab: allow specifying free pointer offset when using constructor
When a slab cache has a constructor, the free pointer is placed after the
object because certain fields must not be overwritten even after the
object is freed.

However, some fields that the constructor does not initialize can safely
be overwritten after free. Allow specifying the free pointer offset within
the object, reducing the overall object size when some fields can be reused
for the free pointer.

Adjust the document accordingly.

Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260113061845.159790-3-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
b85f369b81 mm/slab: use unsigned long for orig_size to ensure proper metadata align
When both KASAN and SLAB_STORE_USER are enabled, accesses to
struct kasan_alloc_meta fields can be misaligned on 64-bit architectures.
This occurs because orig_size is currently defined as unsigned int,
which only guarantees 4-byte alignment. When struct kasan_alloc_meta is
placed after orig_size, it may end up at a 4-byte boundary rather than
the required 8-byte boundary on 64-bit systems.

Note that 64-bit architectures without HAVE_EFFICIENT_UNALIGNED_ACCESS
are assumed to require 64-bit accesses to be 64-bit aligned.
See HAVE_64BIT_ALIGNED_ACCESS and commit adab66b71a ("Revert:
"ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"") for more details.

Change orig_size from unsigned int to unsigned long to ensure proper
alignment for any subsequent metadata. This should not waste additional
memory because kmalloc objects are already aligned to at least
ARCH_KMALLOC_MINALIGN.

Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo
Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 6edf2576a6 ("mm/slub: enable debugging memory wasting of kmalloc")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo/
Link: https://patch.msgid.link/20260113061845.159790-2-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Hao Li
9346ee2b53 slub: clarify object field layout comments
The comments above check_pad_bytes() document the field layout of a
single object. Rewrite them to improve clarity and precision.

Also update an outdated comment in calculate_sizes().

Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Hao Li <hao.li@linux.dev>
Link: https://patch.msgid.link/20251229122415.192377-1-hao.li@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:35 +01:00
Harry Yoo
280ea9c315 mm/slab: avoid allocating slabobj_ext array from its own slab
When allocating slabobj_ext array in alloc_slab_obj_exts(), the array
can be allocated from the same slab we're allocating the array for.
This led to obj_exts_in_slab() incorrectly returning true [1],
although the array is not allocated from wasted space of the slab.

Vlastimil Babka observed that this problem should be fixed even when
ignoring its incompatibility with obj_exts_in_slab(), because it creates
slabs that are never freed as there is always at least one allocated
object.

To avoid this, use the next kmalloc size or large kmalloc when
the array can be allocated from the same cache we're allocating
the array for.

In case of random kmalloc caches, there are multiple kmalloc caches
for the same size and the cache is selected based on the caller address.
Because it is fragile to ensure the same caller address is passed to
kmalloc_slab(), kmalloc_noprof(), and kmalloc_node_noprof(), bump the
size to (s->object_size + 1) when the sizes are equal, instead of
directly comparing the kmem_cache pointers.

Note that this doesn't happen when memory allocation profiling is
disabled, as when the allocation of the array is triggered by memory
cgroup (KMALLOC_CGROUP), the array is allocated from KMALLOC_NORMAL.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202601231457.f7b31e09-lkp@intel.com [1]
Cc: stable@vger.kernel.org
Fixes: 4b87369646 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260126125714.88008-1-harry.yoo@oracle.com
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04 10:05:24 +01:00
Frederic Weisbecker
ce84ad5e99 sched/isolation: Flush vmstat workqueues on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime.
In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is still pending or executing on a
newly made isolated CPU, the housekeeping susbsystem must flush the
vmstat workqueues.

This involves flushing the whole mm_percpu_wq workqueue, shared with
LRU drain, introducing here a welcome side effect.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
b7eb4edcc3 sched/isolation: Flush memcg workqueues on cpuset isolated partition change
The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In
order to synchronize against memcg workqueue to make sure that no
asynchronous draining is still pending or executing on a newly made
isolated CPU, the housekeeping susbsystem must flush the memcg
workqueues.

However the memcg workqueues can't be flushed easily since they are
queued to the main per-CPU workqueue pool.

Solve this with creating a memcg specific pool and provide and use the
appropriate flushing API.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-mm@kvack.org
2026-02-03 15:23:34 +01:00
Frederic Weisbecker
69e227e450 mm: vmstat: Prepare to protect against concurrent isolated cpuset change
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at
runtime. In order to synchronize against vmstat workqueue to make sure
that no asynchronous vmstat work is pending or executing on a newly made
isolated CPU, target and queue a vmstat work under the same RCU read
side critical section.

Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a vmstat
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU has been made isolated.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: linux-mm@kvack.org
2026-02-03 15:21:12 +01:00
Frederic Weisbecker
2d05068610 memcg: Prepare to protect against concurrent isolated cpuset change
The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at
runtime. In order to synchronize against memcg workqueue to make sure
that no asynchronous draining is pending or executing on a newly made
isolated CPU, target and queue a drain work under the same RCU critical
section.

Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg
workqueue flush will also be issued in a further change to make sure
that no work remains pending after a CPU has been made isolated.

Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
2026-02-03 15:19:38 +01:00
Kairui Song
2030dddf95 mm, shmem: prevent infinite loop on truncate race
When truncating a large swap entry, shmem_free_swap() returns 0 when the
entry's index doesn't match the given index due to lookup alignment.  The
failure fallback path checks if the entry crosses the end border and
aborts when it happens, so truncate won't erase an unexpected entry or
range.  But one scenario was ignored.

When `index` points to the middle of a large swap entry, and the large
swap entry doesn't go across the end border, find_get_entries() will
return that large swap entry as the first item in the batch with
`indices[0]` equal to `index`.  The entry's base index will be smaller
than `indices[0]`, so shmem_free_swap() will fail and return 0 due to the
"base < index" check.  The code will then call shmem_confirm_swap(), get
the order, check if it crosses the END boundary (which it doesn't), and
retry with the same index.

The next iteration will find the same entry again at the same index with
same indices, leading to an infinite loop.

Fix this by retrying with a round-down index, and abort if the index is
smaller than the truncate range.

Link: https://lkml.kernel.org/r/aXo6ltB5iqAKJzY8@KASONG-MC4
Fixes: 809bc86517 ("mm: shmem: support large folio swap out")
Fixes: 8a1968bd99 ("mm/shmem, swap: fix race of truncate and swap entry split")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reported-by: Chris Mason <clm@meta.com>
Closes: https://lore.kernel.org/linux-mm/20260128130336.727049-1-clm@meta.com/
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-02 18:43:55 -08:00
Christoph Hellwig
b244c89a70 readahead: push invalidate_lock out of page_cache_ra_unbounded
Require the invalidate_lock to be held over calls to
page_cache_ra_unbounded instead of acquiring it in this function.

This prepares for calling page_cache_ra_unbounded from ->readahead for
fsverity read-ahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20260202060754.270269-3-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-02-02 12:38:13 -08:00
Andrew Morton
2eec08ff09 Merge branch 'mm-hotfixes-stable' into mm-nonmm-stable to pick up changes
required to merge "kho: use unsigned long for nr_pages".
2026-01-31 16:12:21 -08:00
Kairui Song
50c7f34c5c mm, swap: remove no longer needed _swap_info_get
There are now only two users of _swap_info_get after consolidating these
callers, folio_free_swap and swp_swapcount.

folio_free_swap already holds the folio lock, and the folio must be in the
swap cache, _swap_info_get is redundant.

For swp_swapcount, it should use get_swap_device instead.  get_swap_device
increases the device ref count, which is actually a bit safer.  The only
current use is smap walking, and the performance change here is tiny.

And after these changes, _swap_info_get is no longer used, so we can
safely remove it.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-19-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:58 -08:00
Kairui Song
d3852f9692 mm, swap: drop the SWAP_HAS_CACHE flag
Now, the swap cache is managed by the swap table.  All swap cache users
are checking the swap table directly to check the swap cache state. 
SWAP_HAS_CACHE is now just a temporary pin before the first increase from
0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation
(folio_alloc_swap), or before the final free of slots pinned by folio in
swap cache (put_swap_folio).

Drop these two usages.  For the first dup, SWAP_HAS_CACHE pinning was hard
to kill because it used to have multiple meanings, more than just "a slot
is cached".  We have just simplified that and defined that the first dup
is always done with folio locked in swap cache (folio_dup_swap), so stop
checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table)
directly, and add a WARN if a swap entry's count is being increased for
the first time while the folio is not in swap cache.

As for freeing, just let the swap cache free all swap entries of a folio
that have a swap count of zero directly upon folio removal.  We have also
just cleaned up batch freeing to check the swap cache usage using the swap
table: a slot with swap cache in the swap table will not be freed until
its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore.  And
besides, the removal of a folio and freeing of the slots are being done in
the same critical section now, which should improve the performance.

After these two changes, SWAP_HAS_CACHE no longer has any users.  Swap
cache synchronization is also done by the swap table directly, so using
SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer
needed.  Remove all related logic and helpers.  swap_map is now only used
for tracking the count, so all swap_map users can just read it directly,
ignoring the swap_count helper, which was previously used to filter out
the SWAP_HAS_CACHE bit.

The idea of dropping SWAP_HAS_CACHE and using the swap table directly was
initially from Chris's idea of merging all the metadata usage of all swaps
into one place.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:57 -08:00
Kairui Song
e1c5c6be3c mm, swap: clean up and improve swap entries freeing
There are a few problems with the current freeing of swap entries.

When freeing a set of swap entries directly (swap_put_entries_direct,
typically from zapping the page table), it scans the whole swap region
multiple times.  First, it scans the whole region to check if it can be
batch freed and if there is any cached folio.  Then do a batch free only
if the whole region's swap count equals 1.  And if any entry is cached,
even if only one, it will have to walk the whole region again to clean up
the cache.

And if any entry is not in a consistent status with other entries, it will
fall back to order 0 freeing.  For example, if only one of them is cached,
the batch free will fall back.

And the current batch freeing workflow relies on the swap map's
SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which
isn't compatible with the swap table design.

Tidy this up, introduce a new cluster scoped helper for all swap entry
freeing job.  It will batch frees all continuous entries, and just start a
new batch if any inconsistent entry is found.  This may improve the batch
size when the clusters are fragmented.  This should also be more robust
with more sanity checks, and make it clear that a slot pinned by swap
cache will be cleared upon cache reclaim.

And the cache reclaim scan is also now limited to each cluster.  If a
cluster has any clean swap cache left after putting the swap count,
reclaim the cluster only instead of the whole region.

And since a folio's entries are always in the same cluster, putting swap
entries from a folio can also use the new helper directly.

This should be both an optimization and a cleanup, and the new helper is
adapted to the swap table.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-17-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:57 -08:00
Kairui Song
4984d746c8 mm, swap: check swap table directly for checking cache
Instead of looking at the swap map, check swap table directly to tell if a
swap slot is cached.  Prepares for the removal of SWAP_HAS_CACHE.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-16-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:57 -08:00
Kairui Song
270f095179 mm, swap: add folio to swap cache directly on allocation
The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. 
SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.  This
pinning usage here can be dropped by adding the folio to swap cache
directly on allocation.

All swap allocations are folio-based now (except for hibernation), so the
swap allocator can always take the folio as the parameter.  And now both
swap cache (swap table) and swap map are protected by the cluster lock,
scanning the map and inserting the folio can be done in the same critical
section.  This eliminates the time window that a slot is pinned by
SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple
times.

This is both a cleanup and an optimization.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:57 -08:00
Kairui Song
3697615914 mm, swap: cleanup swap entry management workflow
The current swap entry allocation/freeing workflow has never had a clear
definition.  This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed.  Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks.  Also making more optimization
possible.

Swap entry will be mostly freed and free with a folio bound.  The folio
lock will be useful for resolving many swap related races.

Now swap allocation (except hibernation) always starts with a folio in the
swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap table
is fully implemented.  Currently dup still needs the caller to lock the
swap entry container (e.g.  PTL), or a concurrent zap may underflow the
swap count.

Some swap users need to interact with swap count without involving folio
(e.g.  forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
  helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

[kasong@tencent.com: fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi]
  Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
[ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris]
  Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <ryncsn@gmail.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Chris Mason <clm@fb.com>
Cc: Chris Mason <clm@meta.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:56 -08:00
Kairui Song
de85024b34 mm, swap: remove workaround for unsynchronized swap map cache state
Remove the "skip if exists" check from commit a65b0e7607 ("zswap: make
shrinking memcg-aware").  It was needed because there is a tiny time
window between setting the SWAP_HAS_CACHE bit and actually adding the
folio to the swap cache.  If a user is trying to add the folio into the
swap cache but another user was interrupted after setting SWAP_HAS_CACHE
but hasn't added the folio to the swap cache yet, it might lead to a
deadlock.

We have moved the bit setting to the same critical section as adding the
folio, so this is no longer needed.  Remove it and clean it up.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-13-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:56 -08:00
Kairui Song
2732acda82 mm, swap: use swap cache as the swap in synchronize layer
Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE
bit.  Whoever sets the bit first does the actual work to swap in a folio.

This has been causing many issues as it's just a poor implementation of a
bit lock.  Raced users have no idea what is pinning a slot, so it has to
loop with a schedule_timeout_uninterruptible(1), which is ugly and causes
long-tailing or other performance issues.  Besides, the abuse of
SWAP_HAS_CACHE has been causing many other troubles for synchronization or
maintenance.

This is the first step to remove this bit completely.

Now all swap in paths are using the swap cache, and both the swap cache
and swap map are protected by the cluster lock.  So we can just resolve
the swap synchronization with the swap cache layer directly using the
cluster lock and folio lock.  Whoever inserts a folio in the swap cache
first does the swap in work.  And because folios are locked during swap
operations, other raced swap operations will just wait on the folio lock.

The SWAP_HAS_CACHE will be removed in later commit.  For now, we still set
it for some remaining users.  But now we do the bit setting and swap cache
folio adding in the same critical section, after swap cache is ready.  No
one will have to spin on the SWAP_HAS_CACHE bit anymore.

This both simplifies the logic and should improve the performance,
eliminating issues like the one solved in commit 01626a1823 ("mm: avoid
unconditional one-tick sleep when swapcache_prepare fails"), or the
"skip_if_exists" from commit a65b0e7607 ("zswap: make shrinking
memcg-aware"), which will be removed very soon.

[kasong@tencent.com: fix cgroup v1 accounting issue]
 Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:56 -08:00
Kairui Song
78d6a12dd9 mm, swap: split locked entry duplicating into a standalone helper
No feature change, split the common logic into a stand alone helper to be
reused later.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-11-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:55 -08:00
Kairui Song
cda2504c51 mm, swap: consolidate cluster reclaim and usability check
Swap cluster cache reclaim requires releasing the lock, so the cluster may
become unusable after the reclaim.  To prepare for checking swap cache
using the swap table directly, consolidate the swap cluster reclaim and
the check logic.

We will want to avoid touching the cluster's data completely with the swap
table, to avoid RCU overhead here.  And by moving the cluster usable check
into the reclaim helper, it will also help avoid a redundant scan of the
slots if the cluster is no longer usable, and we will want to avoid
touching the cluster.

Also, adjust it very slightly while at it: always scan the whole region
during reclaim, don't skip slots covered by a reclaimed folio.  Because
the reclaim is lockless, it's possible that new cache lands at any time. 
And for allocation, we want all caches to be reclaimed to avoid
fragmentation.  Besides, if the scan offset is not aligned with the size
of the reclaimed folio, we might skip some existing cache and fail the
reclaim unexpectedly.

There should be no observable behavior change.  It might slightly improve
the fragmentation issue or performance.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-10-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:55 -08:00
Kairui Song
f7ad377a92 mm, swap: swap entry of a bad slot should not be considered as swapped out
When checking if a swap entry is swapped out, we simply check if the
bitwise result of the count value is larger than 0.  But SWAP_MAP_BAD will
also be considered as a swao count value larger than 0.

SWAP_MAP_BAD being considered as a count value larger than 0 is useful for
the swap allocator: they will be seen as a used slot, so the allocator
will skip them.  But for the swapped out check, this isn't correct.

There is currently no observable issue.  The swapped out check is only
useful for readahead and folio swapped-out status check.  For readahead,
the swap cache layer will abort upon checking and updating the swap map. 
For the folio swapped out status check, the swap allocator will never
allocate an entry of bad slots to folio, so that part is fine too.  The
worst that could happen now is redundant allocation/freeing of folios and
waste CPU time.

This also makes it easier to get rid of swap map checking and update
during folio insertion in the swap cache layer.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-9-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:55 -08:00
Nhat Pham
bc617c990e mm/shmem, swap: remove SWAP_MAP_SHMEM
The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b
("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
belongs to shmem during swapoff.

However, swapoff has since been rewritten in the commit b56a2d8af9 ("mm:
rid swapoff of quadratic complexity").  Now having swap count ==
SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and
swap_shmem_alloc() behaves analogously to swap_duplicate().  The only
difference of note is that swap_shmem_alloc() does not check for -ENOMEM
returned from __swap_duplicate(), but it is OK because shmem never
re-duplicates any swap entry it owns.  This will stil be safe if we use
(batched) swap_duplicate() instead.

This commit adds swap_duplicate_nr(), the batched variant of
swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated
swap_shmem_alloc() helper to simplify the state machine (both mentally and
in terms of actual code).  We will also have an extra state/special value
that can be repurposed (for swap entries that never gets re-duplicated).

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:55 -08:00
Kairui Song
c246d236b1 mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
Now the overhead of the swap cache is trivial to none, bypassing the swap
cache is no longer a good optimization.

We have removed the cache bypass swapin for anon memory, now do the same
for shmem.  Many helpers and functions can be dropped now.

The performance may slightly drop because of the co-existence and double
update of swap_map and swap table, and this problem will be improved very
soon in later commits by dropping the swap_map update partially:

Swapin of 24 GB file with tmpfs with
transparent_hugepage_tmpfs=within_size and ZRAM, 3 test runs on my
machine:

Before:  After this commit:  After this series:
5.99s    6.29s               6.08s

And later swap table phases will drop the swap_map completely to avoid
overhead and reduce memory usage.

Link: https://lkml.kernel.org/r/20251219195751.61328-1-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:54 -08:00
Kairui Song
4b34f1d82c mm, swap: free the swap cache after folio is mapped
Currently, we remove the folio from the swap cache and free the swap cache
before mapping the PTE.  To reduce repeated faults due to parallel swapins
of the same PTE, change it to remove the folio from the swap cache after
it is mapped.  So new faults from the swap PTE will be much more likely to
see the folio in the swap cache and wait on it.

This does not eliminate all swapin races: an ongoing swapin fault may
still see an empty swap cache.  That's harmless, as the PTE is changed
before the swap cache is cleared, so it will just return and not trigger
any repeated faults.  This does help to reduce the chance.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-6-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:54 -08:00
Kairui Song
6aeec9a1a3 mm, swap: simplify the code and reduce indention
Now swap cache is always used, multiple swap cache checks are no longer
useful, remove them and reduce the code indention.

No behavior change.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-5-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:54 -08:00
Kairui Song
ab08be8dc9 mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
Now SWP_SYNCHRONOUS_IO devices are also using swap cache.  One side effect
is that a folio may stay in swap cache for a longer time due to lazy
freeing (vm_swap_full()).  This can help save some CPU / IO if folios are
being swapped out very frequently right after swapin, hence improving the
performance.  But the long pinning of swap slots also increases the
fragmentation rate of the swap device significantly, and currently, all
in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the
backing memory to be pinned, increasing the memory pressure.

So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after
swapin finishes.  Swap cache has served its role as a synchronization
layer to prevent any parallel swap-in from wasting CPU or memory
allocation, and the redundant IO is not a major concern for
SWP_SYNCHRONOUS_IO devices.

Worth noting, without this patch, this series so far can provide a ~30%
performance gain for certain workloads like MySQL or kernel compilation,
but causes significant regression or OOM when under extreme global
pressure.  With this patch, we still have a nice performance gain for most
workloads, and without introducing any observable regressions.  This is a
hint that further optimization can be done based on the new unified swapin
with swap cache, but for now, just keep the behaviour consistent with
before.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-4-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:54 -08:00
Kairui Song
f1879e8a0c mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
Now the overhead of the swap cache is trivial.  Bypassing the swap cache
is no longer a valid optimization.  So unify the swapin path using the
swap cache.  This changes the swap in behavior in two observable ways.

Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is
a huge win for some workloads: We used to rely on `SWP_SYNCHRONOUS_IO &&
__swap_count(entry) == 1` as the indicator to bypass both the swap cache
and readahead, the swap count check made bypassing ineffective in many
cases, and it's not a good indicator.  The limitation existed because the
current swap design made it hard to decouple readahead bypassing and swap
cache bypassing.  We do want to always bypass readahead for
SWP_SYNCHRONOUS_IO devices, but bypassing swap cache at the same time will
cause repeated IO and memory overhead.  Now that swap cache bypassing is
gone, this swap count check can be dropped.

The second thing here is that this enabled large swapin for all swap
entries on SWP_SYNCHRONOUS_IO devices.  Previously, the large swap in is
also coupled with swap cache bypassing, and so the swap count checking
also makes large swapin less effective.  Now this is also improved.  We
will always have large swapin supported for all SWP_SYNCHRONOUS_IO cases.

And to catch potential issues with large swapin, especially with page
exclusiveness and swap cache, more debug sanity checks and comments are
added.  But overall, the code is simpler.  And new helper and routines
will be used by other components in later commits too.  And now it's
possible to rely on the swap cache layer for resolving synchronization
issues, which will also be done by a later commit.

Worth mentioning that for a large folio workload, this may cause more
serious thrashing.  This isn't a problem with this commit, but a generic
large folio issue.  For a 4K workload, this commit increases the
performance.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-3-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:53 -08:00
Kairui Song
84eedc747b mm, swap: split swap cache preparation loop into a standalone helper
To prepare for the removal of swap cache bypass swapin, introduce a new
helper that accepts an allocated and charged fresh folio, prepares the
folio, the swap map, and then adds the folio to the swap cache.

This doesn't change how swap cache works yet, we are still depending on
the SWAP_HAS_CACHE in the swap map for synchronization.  But all
synchronization hacks are now all in this single helper.

No feature change.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-2-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:53 -08:00
Kairui Song
d7cf0d54f2 mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio
Patch series "mm, swap: swap table phase II: unify swapin use", v5.

This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code
and special swap flag bits including SWAP_HAS_CACHE, along with many
historical issues.  The performance is about ~20% better for some
workloads, like Redis with persistence.  This also cleans up the code to
prepare for later phases, some patches are from a previously posted
series.

Swap cache bypassing and swap synchronization in general had many issues. 
Some are solved as workarounds, and some are still there [1].  To resolve
them in a clean way, one good solution is to always use swap cache as the
synchronization layer [2].  So we have to remove the swap cache bypass
swap-in path first.  It wasn't very doable due to performance issues, but
now combined with the swap table, removing the swap cache bypass path will
instead improve the performance, there is no reason to keep it.

Now we can rework the swap entry and cache synchronization following the
new design.  Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues.  By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.

And swap_map is now only used for swap count, so in the next phase,
swap_map can be merged into the swap table, which will clean up more
things and start to reduce the static memory usage.  Removal of
swap_cgroup_ctrl is also doable, but needs to be done after we also
simplify the allocation of swapin folios: always use the new
swap_cache_alloc_folio helper so the accounting will also be managed by
the swap layer by then.

Test results:

Redis / Valkey bench:
=====================

Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 460475.84 RPS               311591.19 RPS
After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)

Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 306044.38 RPS               102745.88 RPS
After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)

The performance is a lot better when persistence is applied.  This should
apply to many other workloads that involve sharing memory and COW.  A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches.  This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).

vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:

                           Before:         After:
System time:               282.22s         283.47s
Sum Throughput:            5677.35 MB/s    5688.78 MB/s
Single process Throughput: 176.41 MB/s     176.23 MB/s
Free latency:              518477.96 us    521488.06 us

Which is almost identical.

Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1379.91s          1364.22s (-0.11%)

Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1822.52s          1803.33s (-0.11%)

Which is almost identical.

MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).

Before: 318162.18 qps
After:  318512.01 qps (+0.01%)

In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test.  Next phases will start to merge
swap count into swap table and reduce memory usage.

One more gain here is that we now have better support for THP swapin. 
Previously, the THP swapin was bound with swap cache bypassing, which only
works for single-mapped folios.  Removing the bypassing path also enabled
THP swapin for all folios.  The THP swapin is still limited to SYNC_IO
devices, the limitation can be removed later.

This may cause more serious THP thrashing for certain workloads, but
that's not an issue caused by this series, it's a common THP issue we
should resolve separately.


This patch (of 19):

__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.

It's not async, and it's not doing any read.  Rename it to better present
its usage, and prepare to be reworked as part of new swap cache APIs.

Also, add some comments for the function.  Worth noting that the
skip_if_exists argument is an long existing workaround that will be
dropped soon.

Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-1-8862a265a033@tencent.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:53 -08:00
Dennis Zhou
a4818a8beb percpu: add double free check to pcpu_free_area()
Percpu memory provides access via offsets into the percpu address space. 
Offsets are essentially fixed for the lifetime of a chunk and therefore
require all users be good samaritans.  If a user improperly handles the
lifetime of the percpu object, it can result in corruption in a couple of
ways:

 - immediate double free - breaks percpu metadata accounting
 - free after subsequent allocation
    - corruption due to multiple owner problem (either prior owner still
      writes or future allocation happens)
    - potential for oops if the percpu pages are reclaimed as the
      subsequent allocation isn't pinning the pages down
    - can lead to page->private pointers pointing to freed chunks

Sebastian noticed that if this happens, none of the memory debugging
facilities add additional information [1].

This patch aims to catch invalid free scenarios within valid chunks.  To
better guard free_percpu(), we can either add a magic number or some
tracking facility to the percpu subsystem in a separate patch.

The invalid free check in pcpu_free_area() validates that the allocation's
starting bit is set in both alloc_map and bound_map.  The alloc_map bit
test ensures the area is allocated while the bound_map bit test checks we
are freeing from the beginning of an allocation.  We choose not to check
the validity of the offset as that is encoded in page->private being a
valid chunk.

pcpu_stats_area_dealloc() is moved later to only be on the happy path so
stats are only updated on valid frees.

Link: https://lkml.kernel.org/r/20260123205535.35267-1-dennis@kernel.org
Link: https://lore.kernel.org/lkml/20260119074813.ecAFsGaT@linutronix.de/ [1]
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Chistoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:52 -08:00
Li Zhe
46ba5a0118 hugetlb: increase hugepage reservations when using node-specific "hugepages=" cmdline
Commit 3dfd02c900 ("hugetlb: increase number of reserving hugepages via
cmdline") raised the number of hugepages that can be reserved through the
boot-time "hugepages=" parameter for the non-node-specific case, but left
the node-specific form of the same parameter unchanged.

This patch extends the same optimization to node-specific reservations. 
When HugeTLB vmemmap optimization (HVO) is enabled and a node cannot
satisfy the requested hugepages, the code first releases ordinary
struct-page memory of hugepages obtained from the buddy allocator,
allowing their struct-page memory to be reclaimed and reused for
additional hugepage reservations on that node.

This is particularly beneficial for configurations that require identical,
large per-node hugepage reservations.  On a four-node, 384 GB x86 VM, the
patch raises the attainable 2 MiB hugepage reservation from under 374 GB
to more than 379 GB.

Link: https://lkml.kernel.org/r/20260122035002.79958-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-31 14:22:52 -08:00