linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-03-22 07:27:12 +08:00

Author	SHA1	Message	Date
Enze Li	9a2791e748	mm/damon: unify address range representation with damon_addr_range Currently, DAMON defines two identical structures for representing address ranges: damon_system_ram_region and damon_addr_range. Both structures share the same semantic interpretation of a half-open interval [start, end), where the start address is inclusive and the end address is exclusive. This duplication adds unnecessary redundancy and increases maintenance overhead. This patch replaces all uses of damon_system_ram_region with the more generic damon_addr_range structure, ensuring a unified type representation for address ranges within the DAMON subsystem. The change simplifies the codebase, improves readability, and avoids potential inconsistencies in future modifications. Link: https://lkml.kernel.org/r/20260129100845.281734-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-06 15:47:15 -08:00
Thorsten Blum	ad789a85b1	mm/cma: replace snprintf with strscpy in cma_new_area Replace snprintf("%s", ...) with the faster and more direct strscpy(). Link: https://lkml.kernel.org/r/20260126174516.236968-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-06 15:47:15 -08:00
Yosry Ahmed	e2c3b6b21c	mm: zswap: use SG list decompression APIs from zsmalloc Use the new zs_obj_read_sg_() APIs in zswap_decompress(), instead of zs_obj_read_() APIs returning a linear address. The SG list is passed directly to the crypto API, simplifying the logic and dropping the workaround that copies highmem addresses to a buffer. The crypto API should internally linearize the SG list if needed. This avoids the memcpy() in zsmalloc for objects spanning multiple pages, although an equivalent operation will be done internally by acomp/scomp. However, in the future compression algorithms could support handling discontiguous SG lists, completely eliminating the copying for spanning objects. Zsmalloc fills an SG list up to 2 entries in size, so change the input SG list to fit 2 entries. Update the incompressible entries path to use memcpy_from_sglist() to copy the data to the folio. Opportunistically set dlen to PAGE_SIZE in the same code path (rather that at the top of the function) to make it clearer. Drop the goto in zswap_compress() as the code now is not simple enough for an if-else statement instead. Rename 'decomp_ret' to 'ret' and reuse it to keep the intermediate return value of crypto_acomp_decompress() to keep line lengths manageable. No functional change intended. Link: https://lkml.kernel.org/r/20260121013615.2906368-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-06 15:47:14 -08:00
Linus Torvalds	3dc58c9ce1	Merge tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "A couple of late-breaking MM fixes. One against a new-in-this-cycle patch and the other addresses a locking issue which has been there for over a year" * tag 'mm-hotfixes-stable-2026-02-06-12-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/memory-failure: reject unsupported non-folio compound page procfs: avoid fetching build ID while holding VMA lock	2026-02-06 13:07:47 -08:00
Linus Torvalds	5ca98c22b5	Merge tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: "A stable fix for memory allocation profiling tag not being cleared when aborting an allocation due to memcg charge failure (Hao Ge)" * tag 'slab-for-6.19-rc8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single	2026-02-06 09:56:03 -08:00
Joerg Roedel	ad09563660	Merge branches 'fixes', 'arm/smmu/updates', 'intel/vt-d', 'amd/amd-vi' and 'core' into next	2026-02-06 11:10:40 +01:00
Hao Li	98e99fc4ad	slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is set SLAB_NO_OBJ_EXT is set for boot caches, but need_slab_obj_exts() doesn't check this flag. We should return false unconditionally when SLAB_NO_OBJ_EXT is set. Signed-off-by: Hao Li <hao.li@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260205120709.425719-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-06 10:39:36 +01:00
Hao Ge	e6c53ead2d	mm/slab: Add alloc_tagging_slab_free_hook for memcg_alloc_abort_single When CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the following warning may be noticed: [ 3959.023862] ------------[ cut here ]------------ [ 3959.023891] alloc_tag was not cleared (got tag for lib/xarray.c:378) [ 3959.023947] WARNING: ./include/linux/alloc_tag.h:155 at alloc_tag_add+0x128/0x178, CPU#6: mkfs.ntfs/113998 [ 3959.023978] Modules linked in: dns_resolver tun brd overlay exfat btrfs blake2b libblake2b xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel ext4 crc16 mbcache jbd2 rfkill sunrpc vfat fat sg fuse nfnetlink sr_mod virtio_gpu cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper ghash_ce drm sm4 backlight virtio_net net_failover virtio_scsi failover virtio_console virtio_blk virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod i2c_dev aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject] [ 3959.024170] CPU: 6 UID: 0 PID: 113998 Comm: mkfs.ntfs Kdump: loaded Tainted: G W 6.19.0-rc7+ #7 PREEMPT(voluntary) [ 3959.024182] Tainted: [W]=WARN [ 3959.024186] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 [ 3959.024192] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3959.024199] pc : alloc_tag_add+0x128/0x178 [ 3959.024207] lr : alloc_tag_add+0x128/0x178 [ 3959.024214] sp : ffff80008b696d60 [ 3959.024219] x29: ffff80008b696d60 x28: 0000000000000000 x27: 0000000000000240 [ 3959.024232] x26: 0000000000000000 x25: 0000000000000240 x24: ffff800085d17860 [ 3959.024245] x23: 0000000000402800 x22: ffff0000c0012dc0 x21: 00000000000002d0 [ 3959.024257] x20: ffff0000e6ef3318 x19: ffff800085ae0410 x18: 0000000000000000 [ 3959.024269] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 3959.024281] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600064101293 [ 3959.024292] x11: 1fffe00064101292 x10: ffff600064101292 x9 : dfff800000000000 [ 3959.024305] x8 : 00009fff9befed6e x7 : ffff000320809493 x6 : 0000000000000001 [ 3959.024316] x5 : ffff000320809490 x4 : ffff600064101293 x3 : ffff800080691838 [ 3959.024328] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000d5bcd640 [ 3959.024340] Call trace: [ 3959.024346] alloc_tag_add+0x128/0x178 (P) [ 3959.024355] __alloc_tagging_slab_alloc_hook+0x11c/0x1a8 [ 3959.024362] kmem_cache_alloc_lru_noprof+0x1b8/0x5e8 [ 3959.024369] xas_alloc+0x304/0x4f0 [ 3959.024381] xas_create+0x1e0/0x4a0 [ 3959.024388] xas_store+0x68/0xda8 [ 3959.024395] __filemap_add_folio+0x5b0/0xbd8 [ 3959.024409] filemap_add_folio+0x16c/0x7e0 [ 3959.024416] __filemap_get_folio_mpol+0x2dc/0x9e8 [ 3959.024424] iomap_get_folio+0xfc/0x180 [ 3959.024435] __iomap_get_folio+0x2f8/0x4b8 [ 3959.024441] iomap_write_begin+0x198/0xc18 [ 3959.024448] iomap_write_iter+0x2ec/0x8f8 [ 3959.024454] iomap_file_buffered_write+0x19c/0x290 [ 3959.024461] blkdev_write_iter+0x38c/0x978 [ 3959.024470] vfs_write+0x4d4/0x928 [ 3959.024482] ksys_write+0xfc/0x1f8 [ 3959.024489] __arm64_sys_write+0x74/0xb0 [ 3959.024496] invoke_syscall+0xd4/0x258 [ 3959.024507] el0_svc_common.constprop.0+0xb4/0x240 [ 3959.024514] do_el0_svc+0x48/0x68 [ 3959.024520] el0_svc+0x40/0xf8 [ 3959.024526] el0t_64_sync_handler+0xa0/0xe8 [ 3959.024533] el0t_64_sync+0x1ac/0x1b0 [ 3959.024540] ---[ end trace 0000000000000000 ]--- When __memcg_slab_post_alloc_hook() fails, there are two different free paths depending on whether size == 1 or size != 1. In the kmem_cache_free_bulk() path, we do call alloc_tagging_slab_free_hook(). However, in memcg_alloc_abort_single() we don't, the above warning will be triggered on the next allocation. Therefore, add alloc_tagging_slab_free_hook() to the memcg_alloc_abort_single() path. Fixes: `9f9796b413` ("mm, slab: move memcg charging to post-alloc hook") Cc: stable@vger.kernel.org Suggested-by: Hao Li <hao.li@linux.dev> Signed-off-by: Hao Ge <hao.ge@linux.dev> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260204101401.202762-1-hao.ge@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-06 09:51:08 +01:00
Miaohe Lin	ae9fd76c11	mm/memory-failure: reject unsupported non-folio compound page When !CONFIG_TRANSPARENT_HUGEPAGE, a non-folio compound page can appear in a userspace mapping via either vm_insert_*() functions or vm_operatios_struct->fault(). They are not folios, thus should not be considered for folio operations like split. To reject these pages, make sure get_hwpoison_page() is always called as HWPoisonHandlable() will do the right work. [Some commit log borrowed from Zi Yan. Thanks.] Link: https://lkml.kernel.org/r/20260205075328.523211-1-linmiaohe@huawei.com Fixes: `689b898677` ("mm/memory-failure: improve large block size folio handling") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: 是参差 <shicenci@gmail.com> Closes: https://lore.kernel.org/all/PS1PPF7E1D7501F1E4F4441E7ECD056DEADAB98A@PS1PPF7E1D7501F.apcprd02.prod.outlook.com/ Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-05 14:10:00 -08:00
Linus Torvalds	b20624608f	Merge tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Five hotfixes. Two are cc:stable, two are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Documentation: document liveupdate cmdline parameter mm, shmem: prevent infinite loop on truncate race mailmap: update Alexander Mikhalitsyn's emails liveupdate: luo_file: do not clear serialized_data on unfreeze x86/kfence: fix booting on 32bit non-PAE systems	2026-02-04 16:04:00 -08:00
Claudio Imbrenda	728b0e21b4	KVM: S390: Remove PGSTE code from linux/s390 mm Remove the PGSTE config option. Remove all code from linux/s390 mm that involves PGSTEs. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>	2026-02-04 17:00:10 +01:00
Harry Yoo	2f35fee943	mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable caches While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account slab objects, it prevents slab merging because merging can change the metadata layout. As pointed out Vlastimil Babka, disabling merging solely for this memory optimization may not be a net win, because disabling slab merging tends to increase overall memory usage. Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU). Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:36 +01:00
Harry Yoo	a77d6d3386	mm/slab: place slabobj_ext metadata in unused space within s->size When a cache has high s->align value and s->object_size is not aligned to it, each object ends up with some unused space because of alignment. If this wasted space is big enough, we can use it to store the slabobj_ext metadata instead of wasting it. On my system, this happens with caches like kmem_cache, mm_struct, pid, task_struct, sighand_cache, xfs_inode, and others. To place the slabobj_ext metadata within each object, the existing slab_obj_ext() logic can still be used by setting: - slab->obj_exts = slab_address(slab) + (slabobj_ext offset) - stride = s->size slab_obj_ext() doesn't need know where the metadata is stored, so this method works without adding extra overhead to slab_obj_ext(). A good example benefiting from this optimization is xfs_inode (object_size: 992, align: 64). To measure memory savings, 2 millions of files were created on XFS. [ MEMCG=y, MEM_ALLOC_PROFILING=n ] Before patch (creating ~2.64M directories on xfs): Slab: 5175976 kB SReclaimable: 3837524 kB SUnreclaim: 1338452 kB After patch (creating ~2.64M directories on xfs): Slab: 5152912 kB SReclaimable: 3838568 kB SUnreclaim: 1314344 kB (-23.54 MiB) Enjoy the memory savings! Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-10-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:36 +01:00
Harry Yoo	fab0694646	mm/slab: move [__]ksize and slab_ksize() to mm/slub.c To access SLUB's internal implementation details beyond cache flags in ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c. [vbabka@suse.cz: also make __ksize() static and move its kerneldoc to ksize() ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	70089d0188	mm/slab: save memory by allocating slabobj_ext array from leftover The leftover space in a slab is always smaller than s->size, and kmem caches for large objects that are not power-of-two sizes tend to have a greater amount of leftover space per slab. In some cases, the leftover space is larger than the size of the slabobj_ext array for the slab. An excellent example of such a cache is ext4_inode_cache. On my system, the object size is 1136, with a preferred order of 3, 28 objects per slab, and 960 bytes of leftover space per slab. Since the size of the slabobj_ext array is only 224 bytes (w/o mem profiling) or 448 bytes (w/ mem profiling) per slab, the entire array fits within the leftover space. Allocate the slabobj_exts array from this unused space instead of using kcalloc() when it is large enough. The array is allocated from unused space only when creating new slabs, and it doesn't try to utilize unused space if alloc_slab_obj_exts() is called after slab creation because implementing lazy allocation involves more expensive synchronization. The implementation and evaluation of lazy allocation from unused space is left as future-work. As pointed by Vlastimil Babka [1], it could be beneficial when a slab cache without SLAB_ACCOUNT can be created, and some of the allocations from the cache use __GFP_ACCOUNT. For example, xarray does that. To avoid unnecessary overhead when MEMCG (with SLAB_ACCOUNT) and MEM_ALLOC_PROFILING are not used for the cache, allocate the slabobj_ext array only when either of them is enabled on slab allocation. [ MEMCG=y, MEM_ALLOC_PROFILING=n ] Before patch (creating ~2.64M directories on ext4): Slab: 4747880 kB SReclaimable: 4169652 kB SUnreclaim: 578228 kB After patch (creating ~2.64M directories on ext4): Slab: `4724020` kB SReclaimable: 4169188 kB SUnreclaim: 554832 kB (-22.84 MiB) Enjoy the memory savings! Link: https://lore.kernel.org/linux-mm/48029aab-20ea-4d90-bfd1-255592b2018e@suse.cz [1] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-8-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	4b1530f89c	mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poison In the near future, slabobj_ext may reside outside the allocated slab object range within a slab, which could be reported as an out-of-bounds access by KASAN. As suggested by Andrey Konovalov [1], explicitly disable KASAN and KMSAN checks when accessing slabobj_ext within slab allocator, memory profiling, and memory cgroup code. While an alternative approach could be to unpoison slabobj_ext, out-of-bounds accesses outside the slab allocator are generally more common. Move metadata_access_enable()/disable() helpers to mm/slab.h so that it can be used outside mm/slub.c. However, as suggested by Suren Baghdasaryan [2], instead of calling them directly from mm code (which is more prone to errors), change users to access slabobj_ext via get/put APIs: - Users should call get_slab_obj_exts() to access slabobj_metadata and call put_slab_obj_exts() when it's done. - From now on, accessing it outside the section covered by get_slab_obj_exts() ~ put_slab_obj_exts() is illegal. This ensures that accesses to slabobj_ext metadata won't be reported as access violations. Call kasan_reset_tag() in slab_obj_ext() before returning the address to prevent SW or HW tag-based KASAN from reporting false positives. Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/linux-mm/CA+fCnZezoWn40BaS3cgmCeLwjT+5AndzcQLc=wH3BjMCu6_YCw@mail.gmail.com [1] Link: https://lore.kernel.org/linux-mm/CAJuCfpG=Lb4WhYuPkSpdNO4Ehtjm1YcEEK0OM=3g9i=LxmpHSQ@mail.gmail.com [2] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-7-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	7a8e71bc61	mm/slab: use stride to access slabobj_ext Use a configurable stride value when accessing slab object extension metadata instead of assuming a fixed sizeof(struct slabobj_ext). Store stride value in free bits of slab->counters field. This allows for flexibility in cases where the extension is embedded within slab objects. Since these free bits exist only on 64-bit, any future optimizations that need to change stride value cannot be enabled on 32-bit architectures. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-6-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	52f1ca8a45	mm/slab: abstract slabobj_ext access via new slab_obj_ext() helper Currently, the slab allocator assumes that slab->obj_exts is a pointer to an array of struct slabobj_ext objects. However, to support storage methods where struct slabobj_ext is embedded within objects, the slab allocator should not make this assumption. Instead of directly dereferencing the slabobj_exts array, abstract access to struct slabobj_ext via helper functions. Introduce a new API slabobj_ext metadata access: slab_obj_ext(slab, obj_exts, index) - returns the pointer to struct slabobj_ext element at the given index. Directly dereferencing the return value of slab_obj_exts() is no longer allowed. Instead, slab_obj_ext() must always be used to access individual struct slabobj_ext objects. Convert all users to use these APIs. No functional changes intended. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-5-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	a13b68d79d	mm/slab: allow specifying free pointer offset when using constructor When a slab cache has a constructor, the free pointer is placed after the object because certain fields must not be overwritten even after the object is freed. However, some fields that the constructor does not initialize can safely be overwritten after free. Allow specifying the free pointer offset within the object, reducing the overall object size when some fields can be reused for the free pointer. Adjust the document accordingly. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	b85f369b81	mm/slab: use unsigned long for orig_size to ensure proper metadata align When both KASAN and SLAB_STORE_USER are enabled, accesses to struct kasan_alloc_meta fields can be misaligned on 64-bit architectures. This occurs because orig_size is currently defined as unsigned int, which only guarantees 4-byte alignment. When struct kasan_alloc_meta is placed after orig_size, it may end up at a 4-byte boundary rather than the required 8-byte boundary on 64-bit systems. Note that 64-bit architectures without HAVE_EFFICIENT_UNALIGNED_ACCESS are assumed to require 64-bit accesses to be 64-bit aligned. See HAVE_64BIT_ALIGNED_ACCESS and commit `adab66b71a` ("Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"") for more details. Change orig_size from unsigned int to unsigned long to ensure proper alignment for any subsequent metadata. This should not waste additional memory because kmalloc objects are already aligned to at least ARCH_KMALLOC_MINALIGN. Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: stable@vger.kernel.org Fixes: `6edf2576a6` ("mm/slub: enable debugging memory wasting of kmalloc") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo/ Link: https://patch.msgid.link/20260113061845.159790-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Hao Li	9346ee2b53	slub: clarify object field layout comments The comments above check_pad_bytes() document the field layout of a single object. Rewrite them to improve clarity and precision. Also update an outdated comment in calculate_sizes(). Suggested-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20251229122415.192377-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:35 +01:00
Harry Yoo	280ea9c315	mm/slab: avoid allocating slabobj_ext array from its own slab When allocating slabobj_ext array in alloc_slab_obj_exts(), the array can be allocated from the same slab we're allocating the array for. This led to obj_exts_in_slab() incorrectly returning true [1], although the array is not allocated from wasted space of the slab. Vlastimil Babka observed that this problem should be fixed even when ignoring its incompatibility with obj_exts_in_slab(), because it creates slabs that are never freed as there is always at least one allocated object. To avoid this, use the next kmalloc size or large kmalloc when the array can be allocated from the same cache we're allocating the array for. In case of random kmalloc caches, there are multiple kmalloc caches for the same size and the cache is selected based on the caller address. Because it is fragile to ensure the same caller address is passed to kmalloc_slab(), kmalloc_noprof(), and kmalloc_node_noprof(), bump the size to (s->object_size + 1) when the sizes are equal, instead of directly comparing the kmem_cache pointers. Note that this doesn't happen when memory allocation profiling is disabled, as when the allocation of the array is triggered by memory cgroup (KMALLOC_CGROUP), the array is allocated from KMALLOC_NORMAL. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202601231457.f7b31e09-lkp@intel.com [1] Cc: stable@vger.kernel.org Fixes: `4b87369646` ("mm/slab: add allocation accounting into slab allocation and free paths") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260126125714.88008-1-harry.yoo@oracle.com Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>	2026-02-04 10:05:24 +01:00
Frederic Weisbecker	ce84ad5e99	sched/isolation: Flush vmstat workqueues on cpuset isolated partition change The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against vmstat workqueue to make sure that no asynchronous vmstat work is still pending or executing on a newly made isolated CPU, the housekeeping susbsystem must flush the vmstat workqueues. This involves flushing the whole mm_percpu_wq workqueue, shared with LRU drain, introducing here a welcome side effect. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: linux-mm@kvack.org	2026-02-03 15:23:34 +01:00
Frederic Weisbecker	b7eb4edcc3	sched/isolation: Flush memcg workqueues on cpuset isolated partition change The HK_TYPE_DOMAIN housekeeping cpumask is now modifiable at runtime. In order to synchronize against memcg workqueue to make sure that no asynchronous draining is still pending or executing on a newly made isolated CPU, the housekeeping susbsystem must flush the memcg workqueues. However the memcg workqueues can't be flushed easily since they are queued to the main per-CPU workqueue pool. Solve this with creating a memcg specific pool and provide and use the appropriate flushing API. Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org Cc: linux-mm@kvack.org	2026-02-03 15:23:34 +01:00
Frederic Weisbecker	69e227e450	mm: vmstat: Prepare to protect against concurrent isolated cpuset change The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at runtime. In order to synchronize against vmstat workqueue to make sure that no asynchronous vmstat work is pending or executing on a newly made isolated CPU, target and queue a vmstat work under the same RCU read side critical section. Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a vmstat workqueue flush will also be issued in a further change to make sure that no work remains pending after a CPU has been made isolated. Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: linux-mm@kvack.org	2026-02-03 15:21:12 +01:00
Frederic Weisbecker	2d05068610	memcg: Prepare to protect against concurrent isolated cpuset change The HK_TYPE_DOMAIN housekeeping cpumask will soon be made modifiable at runtime. In order to synchronize against memcg workqueue to make sure that no asynchronous draining is pending or executing on a newly made isolated CPU, target and queue a drain work under the same RCU critical section. Whenever housekeeping will update the HK_TYPE_DOMAIN cpumask, a memcg workqueue flush will also be issued in a further change to make sure that no work remains pending after a CPU has been made isolated. Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com>	2026-02-03 15:19:38 +01:00
Kairui Song	2030dddf95	mm, shmem: prevent infinite loop on truncate race When truncating a large swap entry, shmem_free_swap() returns 0 when the entry's index doesn't match the given index due to lookup alignment. The failure fallback path checks if the entry crosses the end border and aborts when it happens, so truncate won't erase an unexpected entry or range. But one scenario was ignored. When `index` points to the middle of a large swap entry, and the large swap entry doesn't go across the end border, find_get_entries() will return that large swap entry as the first item in the batch with `indices[0]` equal to `index`. The entry's base index will be smaller than `indices[0]`, so shmem_free_swap() will fail and return 0 due to the "base < index" check. The code will then call shmem_confirm_swap(), get the order, check if it crosses the END boundary (which it doesn't), and retry with the same index. The next iteration will find the same entry again at the same index with same indices, leading to an infinite loop. Fix this by retrying with a round-down index, and abort if the index is smaller than the truncate range. Link: https://lkml.kernel.org/r/aXo6ltB5iqAKJzY8@KASONG-MC4 Fixes: `809bc86517` ("mm: shmem: support large folio swap out") Fixes: `8a1968bd99` ("mm/shmem, swap: fix race of truncate and swap entry split") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Chris Mason <clm@meta.com> Closes: https://lore.kernel.org/linux-mm/20260128130336.727049-1-clm@meta.com/ Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-02-02 18:43:55 -08:00
Christoph Hellwig	b244c89a70	readahead: push invalidate_lock out of page_cache_ra_unbounded Require the invalidate_lock to be held over calls to page_cache_ra_unbounded instead of acquiring it in this function. This prepares for calling page_cache_ra_unbounded from ->readahead for fsverity read-ahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20260202060754.270269-3-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2026-02-02 12:38:13 -08:00
Andrew Morton	2eec08ff09	Merge branch 'mm-hotfixes-stable' into mm-nonmm-stable to pick up changes required to merge "kho: use unsigned long for nr_pages".	2026-01-31 16:12:21 -08:00
Kairui Song	50c7f34c5c	mm, swap: remove no longer needed _swap_info_get There are now only two users of _swap_info_get after consolidating these callers, folio_free_swap and swp_swapcount. folio_free_swap already holds the folio lock, and the folio must be in the swap cache, _swap_info_get is redundant. For swp_swapcount, it should use get_swap_device instead. get_swap_device increases the device ref count, which is actually a bit safer. The only current use is smap walking, and the performance change here is tiny. And after these changes, _swap_info_get is no longer used, so we can safely remove it. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-19-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:58 -08:00
Kairui Song	d3852f9692	mm, swap: drop the SWAP_HAS_CACHE flag Now, the swap cache is managed by the swap table. All swap cache users are checking the swap table directly to check the swap cache state. SWAP_HAS_CACHE is now just a temporary pin before the first increase from 0 to 1 of a slot's swap count (swap_dup_entries) after swap allocation (folio_alloc_swap), or before the final free of slots pinned by folio in swap cache (put_swap_folio). Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was hard to kill because it used to have multiple meanings, more than just "a slot is cached". We have just simplified that and defined that the first dup is always done with folio locked in swap cache (folio_dup_swap), so stop checking the SWAP_HAS_CACHE bit and just check the swap cache (swap table) directly, and add a WARN if a swap entry's count is being increased for the first time while the folio is not in swap cache. As for freeing, just let the swap cache free all swap entries of a folio that have a swap count of zero directly upon folio removal. We have also just cleaned up batch freeing to check the swap cache usage using the swap table: a slot with swap cache in the swap table will not be freed until its cache is gone, and no SWAP_HAS_CACHE bit is involved anymore. And besides, the removal of a folio and freeing of the slots are being done in the same critical section now, which should improve the performance. After these two changes, SWAP_HAS_CACHE no longer has any users. Swap cache synchronization is also done by the swap table directly, so using SWAP_HAS_CACHE to pin a slot before adding the cache is also no longer needed. Remove all related logic and helpers. swap_map is now only used for tracking the count, so all swap_map users can just read it directly, ignoring the swap_count helper, which was previously used to filter out the SWAP_HAS_CACHE bit. The idea of dropping SWAP_HAS_CACHE and using the swap table directly was initially from Chris's idea of merging all the metadata usage of all swaps into one place. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-18-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	e1c5c6be3c	mm, swap: clean up and improve swap entries freeing There are a few problems with the current freeing of swap entries. When freeing a set of swap entries directly (swap_put_entries_direct, typically from zapping the page table), it scans the whole swap region multiple times. First, it scans the whole region to check if it can be batch freed and if there is any cached folio. Then do a batch free only if the whole region's swap count equals 1. And if any entry is cached, even if only one, it will have to walk the whole region again to clean up the cache. And if any entry is not in a consistent status with other entries, it will fall back to order 0 freeing. For example, if only one of them is cached, the batch free will fall back. And the current batch freeing workflow relies on the swap map's SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which isn't compatible with the swap table design. Tidy this up, introduce a new cluster scoped helper for all swap entry freeing job. It will batch frees all continuous entries, and just start a new batch if any inconsistent entry is found. This may improve the batch size when the clusters are fragmented. This should also be more robust with more sanity checks, and make it clear that a slot pinned by swap cache will be cleared upon cache reclaim. And the cache reclaim scan is also now limited to each cluster. If a cluster has any clean swap cache left after putting the swap count, reclaim the cluster only instead of the whole region. And since a folio's entries are always in the same cluster, putting swap entries from a folio can also use the new helper directly. This should be both an optimization and a cleanup, and the new helper is adapted to the swap table. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-17-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	4984d746c8	mm, swap: check swap table directly for checking cache Instead of looking at the swap map, check swap table directly to tell if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-16-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	270f095179	mm, swap: add folio to swap cache directly on allocation The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation. SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion. This pinning usage here can be dropped by adding the folio to swap cache directly on allocation. All swap allocations are folio-based now (except for hibernation), so the swap allocator can always take the folio as the parameter. And now both swap cache (swap table) and swap map are protected by the cluster lock, scanning the map and inserting the folio can be done in the same critical section. This eliminates the time window that a slot is pinned by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock multiple times. This is both a cleanup and an optimization. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-15-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:57 -08:00
Kairui Song	3697615914	mm, swap: cleanup swap entry management workflow The current swap entry allocation/freeing workflow has never had a clear definition. This makes it hard to debug or add new optimizations. This commit introduces a proper definition of how swap entries would be allocated and freed. Now, most operations are folio based, so they will never exceed one swap cluster, and we now have a cleaner border between swap and the rest of mm, making it much easier to follow and debug, especially with new added sanity checks. Also making more optimization possible. Swap entry will be mostly freed and free with a folio bound. The folio lock will be useful for resolving many swap related races. Now swap allocation (except hibernation) always starts with a folio in the swap cache, and gets duped/freed protected by the folio lock: - folio_alloc_swap() - The only allocation entry point now. Context: The folio must be locked. This allocates one or a set of continuous swap slots for a folio and binds them to the folio by adding the folio to the swap cache. The swap slots' swap count start with zero value. - folio_dup_swap() - Increase the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This increases the ref count of swap entries allocated to a folio. Newly allocated swap slots' count has to be increased by this helper as the folio got unmapped (and swap entries got installed). - folio_put_swap() - Decrease the swap count of one or more entries. Context: The folio must be locked and in the swap cache. For now, the caller still has to lock the new swap entry owner (e.g., PTL). This decreases the ref count of swap entries allocated to a folio. Typically, swapin will decrease the swap count as the folio got installed back and the swap entry got uninstalled This won't remove the folio from the swap cache and free the slot. Lazy freeing of swap cache is helpful for reducing IO. There is already a folio_free_swap() for immediate cache reclaim. This part could be further optimized later. The above locking constraints could be further relaxed when the swap table is fully implemented. Currently dup still needs the caller to lock the swap entry container (e.g. PTL), or a concurrent zap may underflow the swap count. Some swap users need to interact with swap count without involving folio (e.g. forking/zapping the page table or mapping truncate without swapin). In such cases, the caller has to ensure there is no race condition on whatever owns the swap count and use the below helpers: - swap_put_entries_direct() - Decrease the swap count directly. Context: The caller must lock whatever is referencing the slots to avoid a race. Typically the page table zapping or shmem mapping truncate will need to free swap slots directly. If a slot is cached (has a folio bound), this will also try to release the swap cache. - swap_dup_entry_direct() - Increase the swap count directly. Context: The caller must lock whatever is referencing the entries to avoid race, and the entries must already have a swap count > 1. Typically, forking will need to copy the page table and hence needs to increase the swap count of the entries in the table. The page table is locked while referencing the swap entries, so the entries all have a swap count > 1 and can't be freed. Hibernation subsystem is a bit different, so two special wrappers are here: - swap_alloc_hibernation_slot() - Allocate one entry from one device. - swap_free_hibernation_slot() - Free one entry allocated by the above helper. All hibernation entries are exclusive to the hibernation subsystem and should not interact with ordinary swap routines. By separating the workflows, it will be possible to bind folio more tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary pin. This commit should not introduce any behavior change [kasong@tencent.com: fix leak, per Chris Mason. Remove WARN_ON, per Lai Yi] Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com [ryncsn@gmail.com: fix KSM copy pages for swapoff, per Chris] Link: https://lkml.kernel.org/r/aXxkANcET3l2Xu6J@KASONG-MC4 Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-14-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Kairui Song <ryncsn@gmail.com> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Chris Mason <clm@meta.com> Cc: Lai Yi <yi1.lai@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:56 -08:00
Kairui Song	de85024b34	mm, swap: remove workaround for unsynchronized swap map cache state Remove the "skip if exists" check from commit `a65b0e7607` ("zswap: make shrinking memcg-aware"). It was needed because there is a tiny time window between setting the SWAP_HAS_CACHE bit and actually adding the folio to the swap cache. If a user is trying to add the folio into the swap cache but another user was interrupted after setting SWAP_HAS_CACHE but hasn't added the folio to the swap cache yet, it might lead to a deadlock. We have moved the bit setting to the same critical section as adding the folio, so this is no longer needed. Remove it and clean it up. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-13-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:56 -08:00
Kairui Song	2732acda82	mm, swap: use swap cache as the swap in synchronize layer Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to swap in a folio. This has been causing many issues as it's just a poor implementation of a bit lock. Raced users have no idea what is pinning a slot, so it has to loop with a schedule_timeout_uninterruptible(1), which is ugly and causes long-tailing or other performance issues. Besides, the abuse of SWAP_HAS_CACHE has been causing many other troubles for synchronization or maintenance. This is the first step to remove this bit completely. Now all swap in paths are using the swap cache, and both the swap cache and swap map are protected by the cluster lock. So we can just resolve the swap synchronization with the swap cache layer directly using the cluster lock and folio lock. Whoever inserts a folio in the swap cache first does the swap in work. And because folios are locked during swap operations, other raced swap operations will just wait on the folio lock. The SWAP_HAS_CACHE will be removed in later commit. For now, we still set it for some remaining users. But now we do the bit setting and swap cache folio adding in the same critical section, after swap cache is ready. No one will have to spin on the SWAP_HAS_CACHE bit anymore. This both simplifies the logic and should improve the performance, eliminating issues like the one solved in commit `01626a1823` ("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"), or the "skip_if_exists" from commit `a65b0e7607` ("zswap: make shrinking memcg-aware"), which will be removed very soon. [kasong@tencent.com: fix cgroup v1 accounting issue] Link: https://lkml.kernel.org/r/CAMgjq7CGUnzOVG7uSaYjzw9wD7w2dSKOHprJfaEp4CcGLgE3iw@mail.gmail.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:56 -08:00
Kairui Song	78d6a12dd9	mm, swap: split locked entry duplicating into a standalone helper No feature change, split the common logic into a stand alone helper to be reused later. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-11-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Kairui Song	cda2504c51	mm, swap: consolidate cluster reclaim and usability check Swap cluster cache reclaim requires releasing the lock, so the cluster may become unusable after the reclaim. To prepare for checking swap cache using the swap table directly, consolidate the swap cluster reclaim and the check logic. We will want to avoid touching the cluster's data completely with the swap table, to avoid RCU overhead here. And by moving the cluster usable check into the reclaim helper, it will also help avoid a redundant scan of the slots if the cluster is no longer usable, and we will want to avoid touching the cluster. Also, adjust it very slightly while at it: always scan the whole region during reclaim, don't skip slots covered by a reclaimed folio. Because the reclaim is lockless, it's possible that new cache lands at any time. And for allocation, we want all caches to be reclaimed to avoid fragmentation. Besides, if the scan offset is not aligned with the size of the reclaimed folio, we might skip some existing cache and fail the reclaim unexpectedly. There should be no observable behavior change. It might slightly improve the fragmentation issue or performance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-10-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Kairui Song	f7ad377a92	mm, swap: swap entry of a bad slot should not be considered as swapped out When checking if a swap entry is swapped out, we simply check if the bitwise result of the count value is larger than 0. But SWAP_MAP_BAD will also be considered as a swao count value larger than 0. SWAP_MAP_BAD being considered as a count value larger than 0 is useful for the swap allocator: they will be seen as a used slot, so the allocator will skip them. But for the swapped out check, this isn't correct. There is currently no observable issue. The swapped out check is only useful for readahead and folio swapped-out status check. For readahead, the swap cache layer will abort upon checking and updating the swap map. For the folio swapped out status check, the swap allocator will never allocate an entry of bad slots to folio, so that part is fine too. The worst that could happen now is redundant allocation/freeing of folios and waste CPU time. This also makes it easier to get rid of swap map checking and update during folio insertion in the swap cache layer. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-9-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Nhat Pham	bc617c990e	mm/shmem, swap: remove SWAP_MAP_SHMEM The SWAP_MAP_SHMEM state was introduced in the commit `aaa468653b` ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry belongs to shmem during swapoff. However, swapoff has since been rewritten in the commit `b56a2d8af9` ("mm: rid swapoff of quadratic complexity"). Now having swap count == SWAP_MAP_SHMEM value is basically the same as having swap count == 1, and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only difference of note is that swap_shmem_alloc() does not check for -ENOMEM returned from __swap_duplicate(), but it is OK because shmem never re-duplicates any swap entry it owns. This will stil be safe if we use (batched) swap_duplicate() instead. This commit adds swap_duplicate_nr(), the batched variant of swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the associated swap_shmem_alloc() helper to simplify the state machine (both mentally and in terms of actual code). We will also have an extra state/special value that can be repurposed (for swap entries that never gets re-duplicated). Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-8-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:55 -08:00
Kairui Song	c246d236b1	mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Now the overhead of the swap cache is trivial to none, bypassing the swap cache is no longer a good optimization. We have removed the cache bypass swapin for anon memory, now do the same for shmem. Many helpers and functions can be dropped now. The performance may slightly drop because of the co-existence and double update of swap_map and swap table, and this problem will be improved very soon in later commits by dropping the swap_map update partially: Swapin of 24 GB file with tmpfs with transparent_hugepage_tmpfs=within_size and ZRAM, 3 test runs on my machine: Before: After this commit: After this series: 5.99s 6.29s 6.08s And later swap table phases will drop the swap_map completely to avoid overhead and reduce memory usage. Link: https://lkml.kernel.org/r/20251219195751.61328-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:54 -08:00
Kairui Song	4b34f1d82c	mm, swap: free the swap cache after folio is mapped Currently, we remove the folio from the swap cache and free the swap cache before mapping the PTE. To reduce repeated faults due to parallel swapins of the same PTE, change it to remove the folio from the swap cache after it is mapped. So new faults from the swap PTE will be much more likely to see the folio in the swap cache and wait on it. This does not eliminate all swapin races: an ongoing swapin fault may still see an empty swap cache. That's harmless, as the PTE is changed before the swap cache is cleared, so it will just return and not trigger any repeated faults. This does help to reduce the chance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-6-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:54 -08:00
Kairui Song	6aeec9a1a3	mm, swap: simplify the code and reduce indention Now swap cache is always used, multiple swap cache checks are no longer useful, remove them and reduce the code indention. No behavior change. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-5-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:54 -08:00
Kairui Song	ab08be8dc9	mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side effect is that a folio may stay in swap cache for a longer time due to lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios are being swapped out very frequently right after swapin, hence improving the performance. But the long pinning of swap slots also increases the fragmentation rate of the swap device significantly, and currently, all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also causes the backing memory to be pinned, increasing the memory pressure. So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices after swapin finishes. Swap cache has served its role as a synchronization layer to prevent any parallel swap-in from wasting CPU or memory allocation, and the redundant IO is not a major concern for SWP_SYNCHRONOUS_IO devices. Worth noting, without this patch, this series so far can provide a ~30% performance gain for certain workloads like MySQL or kernel compilation, but causes significant regression or OOM when under extreme global pressure. With this patch, we still have a nice performance gain for most workloads, and without introducing any observable regressions. This is a hint that further optimization can be done based on the new unified swapin with swap cache, but for now, just keep the behaviour consistent with before. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-4-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:54 -08:00
Kairui Song	f1879e8a0c	mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Now the overhead of the swap cache is trivial. Bypassing the swap cache is no longer a valid optimization. So unify the swapin path using the swap cache. This changes the swap in behavior in two observable ways. Readahead is now always disabled for SWP_SYNCHRONOUS_IO devices, which is a huge win for some workloads: We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as the indicator to bypass both the swap cache and readahead, the swap count check made bypassing ineffective in many cases, and it's not a good indicator. The limitation existed because the current swap design made it hard to decouple readahead bypassing and swap cache bypassing. We do want to always bypass readahead for SWP_SYNCHRONOUS_IO devices, but bypassing swap cache at the same time will cause repeated IO and memory overhead. Now that swap cache bypassing is gone, this swap count check can be dropped. The second thing here is that this enabled large swapin for all swap entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is also coupled with swap cache bypassing, and so the swap count checking also makes large swapin less effective. Now this is also improved. We will always have large swapin supported for all SWP_SYNCHRONOUS_IO cases. And to catch potential issues with large swapin, especially with page exclusiveness and swap cache, more debug sanity checks and comments are added. But overall, the code is simpler. And new helper and routines will be used by other components in later commits too. And now it's possible to rely on the swap cache layer for resolving synchronization issues, which will also be done by a later commit. Worth mentioning that for a large folio workload, this may cause more serious thrashing. This isn't a problem with this commit, but a generic large folio issue. For a 4K workload, this commit increases the performance. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-3-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:53 -08:00
Kairui Song	84eedc747b	mm, swap: split swap cache preparation loop into a standalone helper To prepare for the removal of swap cache bypass swapin, introduce a new helper that accepts an allocated and charged fresh folio, prepares the folio, the swap map, and then adds the folio to the swap cache. This doesn't change how swap cache works yet, we are still depending on the SWAP_HAS_CACHE in the swap map for synchronization. But all synchronization hacks are now all in this single helper. No feature change. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-2-8862a265a033@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:53 -08:00
Kairui Song	d7cf0d54f2	mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio Patch series "mm, swap: swap table phase II: unify swapin use", v5. This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and special swap flag bits including SWAP_HAS_CACHE, along with many historical issues. The performance is about ~20% better for some workloads, like Redis with persistence. This also cleans up the code to prepare for later phases, some patches are from a previously posted series. Swap cache bypassing and swap synchronization in general had many issues. Some are solved as workarounds, and some are still there [1]. To resolve them in a clean way, one good solution is to always use swap cache as the synchronization layer [2]. So we have to remove the swap cache bypass swap-in path first. It wasn't very doable due to performance issues, but now combined with the swap table, removing the swap cache bypass path will instead improve the performance, there is no reason to keep it. Now we can rework the swap entry and cache synchronization following the new design. Swap cache synchronization was heavily relying on SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage of special swap map bits and related workarounds, we get a cleaner code base and prepare for merging the swap count into the swap table in the next step. And swap_map is now only used for swap count, so in the next phase, swap_map can be merged into the swap table, which will clean up more things and start to reduce the static memory usage. Removal of swap_cgroup_ctrl is also doable, but needs to be done after we also simplify the allocation of swapin folios: always use the new swap_cache_alloc_folio helper so the accounting will also be managed by the swap layer by then. Test results: Redis / Valkey bench: ===================== Testing on a ARM64 VM 1.5G memory: Server: valkey-server --maxmemory 2560M Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 460475.84 RPS 311591.19 RPS After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%) Testing on a x86_64 VM with 4G memory (system components takes about 2G): Server: Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get no persistence with BGSAVE Before: 306044.38 RPS 102745.88 RPS After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%) The performance is a lot better when persistence is applied. This should apply to many other workloads that involve sharing memory and COW. A slight performance drop was observed for the ARM64 Redis test: We are still using swap_map to track the swap count, which is causing redundant cache and CPU overhead and is not very performance-friendly for some arches. This will be improved once we merge the swap map into the swap table (as already demonstrated previously [3]). vm-scabiity =========== usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure, simulated PMEM as swap), average result of 6 test run: Before: After: System time: 282.22s 283.47s Sum Throughput: 5677.35 MB/s 5688.78 MB/s Single process Throughput: 176.41 MB/s 176.23 MB/s Free latency: 518477.96 us 521488.06 us Which is almost identical. Build kernel test: ================== Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1379.91s 1364.22s (-0.11%) Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM with 4G RAM, under global pressure, avg of 32 test run: Before After: System time: 1822.52s 1803.33s (-0.11%) Which is almost identical. MySQL: ====== sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16 --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up). Before: 318162.18 qps After: 318512.01 qps (+0.01%) In conclusion, the result is looking better or identical for most cases, and it's especially better for workloads with swap count > 1 on SYNC_IO devices, about ~20% gain in above test. Next phases will start to merge swap count into swap table and reduce memory usage. One more gain here is that we now have better support for THP swapin. Previously, the THP swapin was bound with swap cache bypassing, which only works for single-mapped folios. Removing the bypassing path also enabled THP swapin for all folios. The THP swapin is still limited to SYNC_IO devices, the limitation can be removed later. This may cause more serious THP thrashing for certain workloads, but that's not an issue caused by this series, it's a common THP issue we should resolve separately. This patch (of 19): __read_swap_cache_async is widely used to allocate and ensure a folio is in swapcache, or get the folio if a folio is already there. It's not async, and it's not doing any read. Rename it to better present its usage, and prepare to be reworked as part of new swap cache APIs. Also, add some comments for the function. Worth noting that the skip_if_exists argument is an long existing workaround that will be dropped soon. Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-1-8862a265a033@tencent.com Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2] Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3] Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:53 -08:00
Dennis Zhou	a4818a8beb	percpu: add double free check to pcpu_free_area() Percpu memory provides access via offsets into the percpu address space. Offsets are essentially fixed for the lifetime of a chunk and therefore require all users be good samaritans. If a user improperly handles the lifetime of the percpu object, it can result in corruption in a couple of ways: - immediate double free - breaks percpu metadata accounting - free after subsequent allocation - corruption due to multiple owner problem (either prior owner still writes or future allocation happens) - potential for oops if the percpu pages are reclaimed as the subsequent allocation isn't pinning the pages down - can lead to page->private pointers pointing to freed chunks Sebastian noticed that if this happens, none of the memory debugging facilities add additional information [1]. This patch aims to catch invalid free scenarios within valid chunks. To better guard free_percpu(), we can either add a magic number or some tracking facility to the percpu subsystem in a separate patch. The invalid free check in pcpu_free_area() validates that the allocation's starting bit is set in both alloc_map and bound_map. The alloc_map bit test ensures the area is allocated while the bound_map bit test checks we are freeing from the beginning of an allocation. We choose not to check the validity of the offset as that is encoded in page->private being a valid chunk. pcpu_stats_area_dealloc() is moved later to only be on the happy path so stats are only updated on valid frees. Link: https://lkml.kernel.org/r/20260123205535.35267-1-dennis@kernel.org Link: https://lore.kernel.org/lkml/20260119074813.ecAFsGaT@linutronix.de/ [1] Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Chistoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:52 -08:00
Li Zhe	46ba5a0118	hugetlb: increase hugepage reservations when using node-specific "hugepages=" cmdline Commit `3dfd02c900` ("hugetlb: increase number of reserving hugepages via cmdline") raised the number of hugepages that can be reserved through the boot-time "hugepages=" parameter for the non-node-specific case, but left the node-specific form of the same parameter unchanged. This patch extends the same optimization to node-specific reservations. When HugeTLB vmemmap optimization (HVO) is enabled and a node cannot satisfy the requested hugepages, the code first releases ordinary struct-page memory of hugepages obtained from the buddy allocator, allowing their struct-page memory to be reclaimed and reused for additional hugepage reservations on that node. This is particularly beneficial for configurations that require identical, large per-node hugepage reservations. On a four-node, 384 GB x86 VM, the patch raises the attainable 2 MiB hugepage reservation from under 374 GB to more than 379 GB. Link: https://lkml.kernel.org/r/20260122035002.79958-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2026-01-31 14:22:52 -08:00

1 2 3 4 5 ...

25960 Commits