linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-04-17 03:39:07 +08:00

Author	SHA1	Message	Date
NeilBrown	64a56f635a	exportfs: remove locking around ->get_parent() call. The locking around the ->get_parent() call brings no value. We are locking a child which is only used to find an inode and thence the parent inode number. All further activity involves the parent inode which may have several children so locking one child cannot protect the parent in any useful way. The filesystem must already ensure that only one 'struct inode' exists for a given inode, and will call d_obtain_alias() which contains the required locking to ensure only one dentry will be attached to that inode. So remove the unnecessary locking. Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/174190497326.9342.9313518146512158587@noble.neil.brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-14 11:39:59 +01:00
Mateusz Guzik	dc530c44cd	fs: use debug-only asserts around fd allocation and install This also restores the check which got removed in `52732bb9ab` ("fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd()") for performance reasons -- they no longer apply with a debug-only variant. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250312161941.1261615-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-13 09:54:15 +01:00
Mateusz Guzik	86f40fa6a4	fs: dodge an atomic in putname if ref == 1 While the structure is refcounted, the only consumer incrementing it is audit and even then the atomic operation is only needed when it interacts with io_uring. If putname spots a count of 1, there is no legitimate way for anyone to bump it. If audit is disabled, the count is guaranteed to be 1, which consistently elides the atomic for all path lookups. If audit is enabled, it still manages to elide the last decrement. Note the patch does not do anything to prevent audit from suffering atomics. See [1] and [2] for a different approach. Benchmarked on Sapphire Rapids issuing access() (ops/s): before: 5106246 after: 5269678 (+3%) Link 1: https://lore.kernel.org/linux-fsdevel/20250307161155.760949-1-mjguzik@gmail.com/ Link 2: https://lore.kernel.org/linux-fsdevel/20250307164216.GI2023217@ZenIV/ Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250311181804.1165758-1-mjguzik@gmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-12 09:35:33 +01:00
Jan Kara	93fd0d46cb	vfs: Remove invalidate_inodes() The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they are equivalent as evict_inodes() repeats the refcount check after having grabbed ->i_lock. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-08 12:19:22 +01:00
Eric Sandeen	66447acc09	ecryptfs: remove NULL remount_fs from super_operations This got missed during the mount API conversion. This makes no functional difference, but after we convert the last filesystem, we'll want to remove the remount_fs op from super_operations altogether. So let's just get this out of the way now. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/5dc2eb45-7126-4777-a7f9-29d02dff443f@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-01 11:02:36 +01:00
Eric Sandeen	f13abc1e8e	watch_queue: fix pipe accounting mismatch Currently, watch_queue_set_size() modifies the pipe buffers charged to user->pipe_bufs without updating the pipe->nr_accounted on the pipe itself, due to the if (!pipe_has_watch_queue()) test in pipe_resize_ring(). This means that when the pipe is ultimately freed, we decrement user->pipe_bufs by something other than what than we had charged to it, potentially leading to an underflow. This in turn can cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM. To remedy this, explicitly account for the pipe usage in watch_queue_set_size() to match the number set via account_pipe_buffers() (It's unclear why watch_queue_set_size() does not update nr_accounted; it may be due to intentional overprovisioning in watch_queue_set_size()?) Fixes: `e95aada4cb` ("pipe: wakeup wr_wait after setting max_usage") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-01 11:02:35 +01:00
Pan Deng	e249056c91	fs: place f_ref to 3rd cache line in struct file to resolve false sharing When running syscall pread in a high core count system, f_ref contends with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the same cache line. This change places f_ref to the 3rd cache line where fields are not updated as frequently as the 1st cache line, and the contention is grealy reduced according to tests. In addition, the size of file object is kept in 3 cache lines. This change has been tested with rocksdb benchmark readwhilewriting case in 1 socket 64 physical core 128 logical core baremetal machine, with build config CONFIG_RANDSTRUCT_NONE=y Command: ./db_bench --benchmarks="readwhilewriting" --threads $cnt --duration 60 The throughput(ops/s) is improved up to ~21%. ===== thread baseline compare 16 100% +1.3% 32 100% +2.2% 64 100% +7.2% 128 100% +20.9% It was also tested with UnixBench: syscall, fsbuffer, fstime, fsdisk cases that has been used for file struct layout tuning, no regression was observed. Signed-off-by: Pan Deng <pan.deng@intel.com> Link: https://lore.kernel.org/r/20250228020059.3023375-1-pan.deng@intel.com Tested-by: Lipeng Zhu <lipeng.zhu@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-03-01 11:02:35 +01:00
Lin Feng	d3a194d95f	epoll: simplify ep_busy_loop by removing always 0 argument Commit 00b27634bc471("epoll: replace gotos with a proper loop") refactored ep_poll and always feeds ep_busy_loop with a time_out value of 0, nonblock mode for ep_busy_loop has sunk since, IOW nonblock mode checking has been taken over by ep_poll itself, codes snipped: static int ep_poll(struct eventpoll *ep... { ... if (timed_out) return 0; eavail = ep_busy_loop(ep, timed_out); ... } Given this fact codes can be simplified a bit for ep_busy_loop. Signed-off-by: Lin Feng <linf@wangsu.com> Link: https://lore.kernel.org/r/20250227033412.5873-1-linf@wangsu.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 09:40:56 +01:00
Matthew Wilcox (Oracle)	12851bd921	fs: Turn page_offset() into a wrapper around folio_pos() This is far less efficient for the lagging filesystems which still use page_offset(), but it removes an access to page->index. It also fixes a bug -- if any filesystem passed a tail page to page_offset(), it would return garbage which might result in the filesystem choosing to not writeback a dirty page. There probably aren't any examples of this, but I can't be certain. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250221203932.3588740-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:41:43 +01:00
Colin Ian King	d1c735d44c	kcmp: improve performance adding an unlikely hint to task comparisons Adding an unlikely() hint on task comparisons on an unlikely error return path improves run-time performance of the kcmp system call. Benchmarking on an i9-12900 shows an improvement of ~5.5% on kcmp(). Results based on running 20 tests with turbo disabled (to reduce clock freq turbo changes), with 10 second run per test and comparing the number of kcmp calls per second. The % Standard deviation of 20 tests was ~0.25%, results are reliable. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20250213163916.709392-1-colin.i.king@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:33 +01:00
Mateusz Guzik	1479be6258	vfs: inline new_inode_pseudo() and de-staticize alloc_inode() The former is a no-op wrapper with the same argument. I left it in place to not lose the information who needs it -- one day "pseudo" inodes may start differing from what alloc_inode() returns. In the meantime no point taking a detour. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250212180459.1022983-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Christian Brauner	da06e3c517	fs: don't needlessly acquire f_lock Before 2011 there was no meaningful synchronization between read/readdir/write/seek. Only in commit `ef3d0fd27e` ("vfs: do (nearly) lockless generic_file_llseek") synchronization was added for SEEK_CUR by taking f_lock around vfs_setpos(). Then in 2014 full synchronization between read/readdir/write/seek was added in commit `9c225f2655` ("vfs: atomic f_pos accesses as per POSIX") by introducing f_pos_lock for regular files with FMODE_ATOMIC_POS and for directories. At that point taking f_lock became unnecessary for such files. So only acquire f_lock for SEEK_CUR if this isn't a file that would have acquired f_pos_lock if necessary. Link: https://lore.kernel.org/r/20250207-daten-mahlzeit-99d2079864fb@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Mateusz Guzik	1bb772565f	vfs: inline getname() It is merely a trivial wrapper around getname_flags which adds a zeroed argument, no point paying for an extra call. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250206000105.432528-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Christian Brauner	8190db20c2	Merge patch series "Fix the return type of several functions from long to int" Yuichiro Tsuji <yuichtsu@amazon.com> says: These patches fix the return type of several functions from long to int to match its actual behavior. * patches from https://lore.kernel.org/r/20250121070844.4413-1-yuichtsu@amazon.com: ioctl: Fix return type of several functions from long to int open: Fix return type of several functions from long to int Link: https://lore.kernel.org/r/20250121070844.4413-1-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Mateusz Guzik	d6ff4c8f65	fs: avoid mmap sem relocks when coredumping with many missing pages Dumping processes with large allocated and mostly not-faulted areas is very slow. Borrowing a test case from Tavian Barnes: int main(void) { char *mem = mmap(NULL, 1ULL << 40, PROT_READ \| PROT_WRITE, MAP_ANONYMOUS \| MAP_NORESERVE \| MAP_PRIVATE, -1, 0); printf("%p %m\n", mem); if (mem != MAP_FAILED) { mem[0] = 1; } abort(); } That's 1TB of almost completely not-populated area. On my test box it takes 13-14 seconds to dump. The profile shows: - 99.89% 0.00% a.out entry_SYSCALL_64_after_hwframe do_syscall_64 syscall_exit_to_user_mode arch_do_signal_or_restart - get_signal - 99.89% do_coredump - 99.88% elf_core_dump - dump_user_range - 98.12% get_dump_page - 64.19% __get_user_pages - 40.92% gup_vma_lookup - find_vma - mt_find 4.21% __rcu_read_lock 1.33% __rcu_read_unlock - 3.14% check_vma_flags 0.68% vma_is_secretmem 0.61% __cond_resched 0.60% vma_pgtable_walk_end 0.59% vma_pgtable_walk_begin 0.58% no_page_table - 15.13% down_read_killable 0.69% __cond_resched 13.84% up_read 0.58% __cond_resched Almost 29% of the time is spent relocking the mmap semaphore between calls to get_dump_page() which find nothing. Whacking that results in times of 10 seconds (down from 13-14). While here make the thing killable. The real problem is the page-sized iteration and the real fix would patch it up instead. It is left as an exercise for the mm-familiar reader. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250119103205.2172432-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Yuichiro Tsuji	f326565c44	ioctl: Fix return type of several functions from long to int Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com> Link: https://lore.kernel.org/r/20250121070844.4413-3-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Yuichiro Tsuji	29d80d506b	open: Fix return type of several functions from long to int Fix the return type of several functions from long to int to match its actu al behavior. These functions only return int values. This change improves type consistency across the filesystem code and aligns the function signatu re with its existing implementation and usage. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Yuichiro Tsuji <yuichtsu@amazon.com> Link: https://lore.kernel.org/r/20250121070844.4413-2-yuichtsu@amazon.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:32 +01:00
Al Viro	f9835fa147	make use of anon_inode_getfile_fmode() ["fallen through the cracks" misc stuff] A bunch of anon_inode_getfile() callers follow it with adjusting ->f_mode; we have a helper doing that now, so let's make use of it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250118014434.GT1977892@ZenIV Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:25:31 +01:00
Christian Brauner	822c115925	Merge patch series "CONFIG_DEBUG_VFS at last" Mateusz Guzik <mjguzik@gmail.com> says: This adds a super basic version just to get the mechanism going, along with sample usage. The macro set is incomplete (e.g., lack of locking macros) and dump_inode routine fails to dump any state yet, to be implemented(tm). I think despite the primitive state this is complete enough to start sprinkling asserts as necessary. * patches from https://lore.kernel.org/r/20250209185523.745956-1-mjguzik@gmail.com: vfs: use the new debug macros in inode_set_cached_link() vfs: catch invalid modes in may_open() vfs: add initial support for CONFIG_DEBUG_VFS Link: https://lore.kernel.org/r/20250209185523.745956-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:23:54 +01:00
Mateusz Guzik	3eb7e95104	vfs: use the new debug macros in inode_set_cached_link() Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250209185523.745956-4-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:23:53 +01:00
Mateusz Guzik	af153bb63a	vfs: catch invalid modes in may_open() Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250209185523.745956-3-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:23:53 +01:00
Mateusz Guzik	8b17e54096	vfs: add initial support for CONFIG_DEBUG_VFS Small collection of macros taken from mmdebug.h Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250209185523.745956-2-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-21 10:23:53 +01:00
Linus Torvalds	2014c95afe	Linux 6.14-rc1 v6.14-rc1	2025-02-02 15:39:26 -08:00
Linus Torvalds	d79bc8f79b	Merge tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat updates from Len Brown: - Fix regression that affinitized forked child in one-shot mode. - Harden one-shot mode against hotplug online/offline - Enable RAPL SysWatt column by default - Add initial PTL, CWF platform support - Harden initial PMT code in response to early use - Enable first built-in PMT counter: CWF c1e residency - Refuse to run on unsupported platforms without --force, to encourage updating to a version that supports the system, and to avoid no-so-useful measurement results * tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (25 commits) tools/power turbostat: version 2025.02.02 tools/power turbostat: Add CPU%c1e BIC for CWF tools/power turbostat: Harden one-shot mode against cpu offline tools/power turbostat: Fix forked child affinity regression tools/power turbostat: Add tcore clock PMT type tools/power turbostat: version 2025.01.14 tools/power turbostat: Allow adding PMT counters directly by sysfs path tools/power turbostat: Allow mapping multiple PMT files with the same GUID tools/power turbostat: Add PMT directory iterator helper tools/power turbostat: Extend PMT identification with a sequence number tools/power turbostat: Return default value for unmapped PMT domains tools/power turbostat: Check for non-zero value when MSR probing tools/power turbostat: Enhance turbostat self-performance visibility tools/power turbostat: Add fixed RAPL PSYS divisor for SPR tools/power turbostat: Fix PMT mmaped file size rounding tools/power turbostat: Remove SysWatt from DISABLED_BY_DEFAULT tools/power turbostat: Add an NMI column tools/power turbostat: add Busy% to "show idle" tools/power turbostat: Introduce --force parameter tools/power turbostat: Improve --help output ...	2025-02-02 10:49:13 -08:00
Linus Torvalds	5d82ca7b50	Merge tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux Pull sh updates from John Paul Adrian Glaubitz: "Fixes and improvements for sh: - replace seq_printf() with the more efficient seq_put_decimal_ull_width() to increase performance when stress reading /proc/interrupts (David Wang) - migrate sh to the generic rule for built-in DTB to help avoid race conditions during parallel builds which can occur because Kbuild decends into arch//boot/dts twice (Masahiro Yamada) - replace select with imply in the board Kconfig for enabling hardware with complex dependencies. This addresses warnings which were reported by the kernel test robot (Geert Uytterhoeven)" tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux: sh: boards: Use imply to enable hardware with complex dependencies sh: Migrate to the generic rule for built-in DTB sh: irq: Use seq_put_decimal_ull_width() for decimal values	2025-02-02 10:40:27 -08:00
Len Brown	2c4627c8ce	tools/power turbostat: version 2025.02.02 Summary of Changes since 2024.11.30: Fix regression in 2023.11.07 that affinitized forked child in one-shot mode. Harden one-shot mode against hotplug online/offline Enable RAPL SysWatt column by default. Add initial PTL, CWF platform support. Harden initial PMT code in response to early use. Enable first built-in PMT counter: CWF c1e residency Refuse to run on unsupported platforms without --force, to encourage updating to a version that supports the system, and to avoid no-so-useful measurement results. Signed-off-by: Len Brown <len.brown@intel.com>	2025-02-02 10:54:23 -06:00
Linus Torvalds	a86bf2283d	Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h	2025-02-01 15:07:56 -08:00
Linus Torvalds	c270ab5a87	Merge tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fix from Thomas Bogendoerfer: "Revert commit breaking sysv ipc for o32 ABI" * tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: Revert "mips: fix shmctl/semctl/msgctl syscall for o32"	2025-02-01 14:54:33 -08:00
Linus Torvalds	cabb4685d5	Merge tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - various updates for special file handling: symlink handling, support for creating sockets, cleanups, new mount options (e.g. to allow disabling using reparse points for them, and to allow overriding the way symlinks are saved), and fixes to error paths - fix for kerberos mounts (allow IAKerb) - SMB1 fix for stat and for setting SACL (auditing) - fix an incorrect error code mapping - cleanups" * tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: (21 commits) cifs: Fix parsing native symlinks directory/file type cifs: update internal version number cifs: Add support for creating WSL-style symlinks smb3: add support for IAKerb cifs: Fix struct FILE_ALL_INFO cifs: Add support for creating NFS-style symlinks cifs: Add support for creating native Windows sockets cifs: Add mount option -o reparse=none cifs: Add mount option -o symlink= for choosing symlink create type cifs: Fix creating and resolving absolute NT-style symlinks cifs: Simplify reparse point check in cifs_query_path_info() function cifs: Remove symlink member from cifs_open_info_data union cifs: Update description about ACL permissions cifs: Rename struct reparse_posix_data to reparse_nfs_data_buffer and move to common/smb2pdu.h cifs: Remove struct reparse_posix_data from struct cifs_open_info_data cifs: Remove unicode parameter from parse_reparse_point() function cifs: Fix getting and setting SACLs over SMB1 cifs: Remove intermediate object of failed create SFU call cifs: Validate EAs for WSL reparse points cifs: Change translation of STATUS_PRIVILEGE_NOT_HELD to -EPERM ...	2025-02-01 11:30:41 -08:00
Linus Torvalds	8c198ffd63	Merge tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull debugfs fix from Greg KH: "Here is a single debugfs fix from Al to resolve a reported regression in the driver-core tree. It has been reported to fix the issue" * tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Fix the missing initializations in __debugfs_file_get()	2025-02-01 10:04:29 -08:00
Linus Torvalds	03cc3579bc	Merge tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 8 are cc:stable and the remainder address post-6.13 issues. 13 are for MM and 8 are for non-MM. All are singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) MAINTAINERS: include linux-mm for xarray maintenance revert "xarray: port tests to kunit" MAINTAINERS: add lib/test_xarray.c mailmap, MAINTAINERS, docs: update Carlos's email address mm/hugetlb: fix hugepage allocation for interleaved memory nodes mm: gup: fix infinite loop within __get_longterm_locked mm, swap: fix reclaim offset calculation error during allocation .mailmap: update email address for Christopher Obbard kfence: skip __GFP_THISNODE allocations on NUMA systems nilfs2: fix possible int overflows in nilfs_fiemap() mm: compaction: use the proper flag to determine watermarks kernel: be more careful about dup_mmap() failures and uprobe registering mm/fake-numa: handle cases with no SRAT info mm: kmemleak: fix upper boundary check for physical address objects mailmap: add an entry for Hamza Mahfooz MAINTAINERS: mailmap: update Yosry Ahmed's email address scripts/gdb: fix aarch64 userspace detection in get_current_task mm/vmscan: accumulate nr_demoted for accurate demotion statistics ocfs2: fix incorrect CPU endianness conversion causing mount failure mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc() ...	2025-02-01 09:49:20 -08:00
Linus Torvalds	c6fe03a3f9	Merge tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fix from Mauro Carvalho Chehab: "A revert for a regression in the uvcvideo driver" * tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: Revert "media: uvcvideo: Require entities to have a non-zero unique ID"	2025-02-01 09:15:01 -08:00
Andrew Morton	e5b2a356dc	MAINTAINERS: include linux-mm for xarray maintenance MM developers have an interest in the xarray code. Cc: David Gow <davidgow@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Tamir Duberstein <tamird@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:28 -08:00
Andrew Morton	050339050f	revert "xarray: port tests to kunit" Revert `c7bb5cf9fc` ("xarray: port tests to kunit"). It broke the build when compiing the xarray userspace test harness code. Reported-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Closes: https://lkml.kernel.org/r/07cf896e-adf8-414f-a629-a808fc26014a@oracle.com Cc: David Gow <davidgow@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tamir Duberstein <tamird@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:28 -08:00
Tamir Duberstein	0ca2a41e0c	MAINTAINERS: add lib/test_xarray.c Ensure test-only changes are sent to the relevant maintainer. Link: https://lkml.kernel.org/r/20250129-xarray-test-maintainer-v1-1-482e31f30f47@gmail.com Signed-off-by: Tamir Duberstein <tamird@gmail.com> Cc: Mattew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:27 -08:00
Carlos Bilbao	e5eaa1bbe2	mailmap, MAINTAINERS, docs: update Carlos's email address Update .mailmap to reflect my new (and final) primary email address, carlos.bilbao@kernel.org. Also update contact information in files Documentation/translations/sp_SP/index.rst and MAINTAINERS. Link: https://lkml.kernel.org/r/20250130012248.1196208-1-carlos.bilbao@kernel.org Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: Carlos Bilbao <bilbao@vt.edu> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mattew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:27 -08:00
Ritesh Harjani (IBM)	76e961157e	mm/hugetlb: fix hugepage allocation for interleaved memory nodes gather_bootmem_prealloc() assumes the start nid as 0 and size as num_node_state(N_MEMORY). That means in case if memory attached numa nodes are interleaved, then gather_bootmem_prealloc_parallel() will fail to scan few of these nodes. Since memory attached numa nodes can be interleaved in any fashion, hence ensure that the current code checks for all numa node ids (.size = nr_node_ids). Let's still keep max_threads as N_MEMORY, so that it can distributes all nr_node_ids among the these many no. threads. e.g. qemu cmdline ======================== numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" mem_cmd="-object memory-backend-ram,id=mem1,size=16G" w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): ========================== ~ # cat /proc/meminfo \|grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 0 kB with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): =========================== ~ # cat /proc/meminfo \|grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 2 HugePages_Free: 2 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 2097152 kB Link: https://lkml.kernel.org/r/f8d8dad3a5471d284f54185f65d575a6aaab692b.1736592534.git.ritesh.list@gmail.com Fixes: `b78b27d029` ("hugetlb: parallelize 1G hugetlb initialization") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reported-by: Pavithra Prakash <pavrampu@linux.ibm.com> Suggested-by: Muchun Song <muchun.song@linux.dev> Tested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Gang Li <gang.li@linux.dev> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:27 -08:00
Zhaoyang Huang	1aaf8c1229	mm: gup: fix infinite loop within __get_longterm_locked We can run into an infinite loop in __get_longterm_locked() when collect_longterm_unpinnable_folios() finds only folios that are isolated from the LRU or were never added to the LRU. This can happen when all folios to be pinned are never added to the LRU, for example when vm_ops->fault allocated pages using cma_alloc() and never added them to the LRU. Fix it by simply taking a look at the list in the single caller, to see if anything was added. [zhaoyang.huang@unisoc.com: move definition of local] Link: https://lkml.kernel.org/r/20250122012604.3654667-1-zhaoyang.huang@unisoc.com Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com Fixes: `67e139b02d` ("mm/gup.c: refactor check_and_migrate_movable_pages()") Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aijun Sun <aijun.sun@unisoc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:27 -08:00
Kairui Song	498c48c66e	mm, swap: fix reclaim offset calculation error during allocation There is a code error that will cause the swap entry allocator to reclaim and check the whole cluster with an unexpected tail offset instead of the part that needs to be reclaimed. This may cause corruption of the swap map, so fix it. Link: https://lkml.kernel.org/r/20250130115131.37777-1-ryncsn@gmail.com Fixes: `3b644773ee` ("mm, swap: reduce contention on device lock") Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chris Li <chrisl@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:26 -08:00
Christopher Obbard	1ccae30ecd	.mailmap: update email address for Christopher Obbard Update my email address. Link: https://lkml.kernel.org/r/20250122-wip-obbardc-update-email-v2-1-12bde6b79ad0@linaro.org Signed-off-by: Christopher Obbard <christopher.obbard@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:26 -08:00
Marco Elver	e64f81946a	kfence: skip __GFP_THISNODE allocations on NUMA systems On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on a particular node, and failure to allocate on the desired node will result in a failed allocation. Skip __GFP_THISNODE allocations if we are running on a NUMA system, since KFENCE can't guarantee which node its pool pages are allocated on. Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com Fixes: `236e9f1538` ("kfence: skip all GFP_ZONEMASK allocations") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Alexander Potapenko <glider@google.com> Cc: Chistoph Lameter <cl@linux.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:26 -08:00
Nikita Zhandarovich	6438ef381c	nilfs2: fix possible int overflows in nilfs_fiemap() Since nilfs_bmap_lookup_contig() in nilfs_fiemap() calculates its result by being prepared to go through potentially maxblocks == INT_MAX blocks, the value in n may experience an overflow caused by left shift of blkbits. While it is extremely unlikely to occur, play it safe and cast right hand expression to wider type to mitigate the issue. Found by Linux Verification Center (linuxtesting.org) with static analysis tool SVACE. Link: https://lkml.kernel.org/r/20250124222133.5323-1-konishi.ryusuke@gmail.com Fixes: `622daaff0a` ("nilfs2: fiemap support") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:26 -08:00
yangge	6268f0a166	mm: compaction: use the proper flag to determine watermarks There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of memory. I have configured 16GB of CMA memory on each NUMA node, and starting a 32GB virtual machine with device passthrough is extremely slow, taking almost an hour. Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB of no-CMA memory on a NUMA node can be used as virtual machine memory. There is 16GB of free CMA memory on a NUMA node, which is sufficient to pass the order-0 watermark check, causing the __compaction_suitable() function to consistently return true. For costly allocations, if the __compaction_suitable() function always returns true, it causes the __alloc_pages_slowpath() function to fail to exit at the appropriate point. This prevents timely fallback to allocating memory on other nodes, ultimately resulting in excessively long virtual machine startup times. Call trace: __alloc_pages_slowpath if (compact_result == COMPACT_SKIPPED \|\| compact_result == COMPACT_DEFERRED) goto nopage; // should exit __alloc_pages_slowpath() from here We could use the real unmovable allocation context to have __zone_watermark_unusable_free() subtract CMA pages, and thus we won't pass the order-0 check anymore once the non-CMA part is exhausted. There is some risk that in some different scenario the compaction could in fact migrate pages from the exhausted non-CMA part of the zone to the CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY allocations should be affected in the immediate "goto nopage" when compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't fail without trying to compact-migrate the non-CMA pageblocks into CMA pageblocks first, so it should be fine. After this fix, it only takes a few tens of seconds to start a 32GB virtual machine with device passthrough functionality. Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/ Link: https://lkml.kernel.org/r/1737788037-8439-1-git-send-email-yangge1116@126.com Signed-off-by: yangge <yangge1116@126.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:25 -08:00
Liam R. Howlett	64c37e134b	kernel: be more careful about dup_mmap() failures and uprobe registering If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although `8ac662f5da` ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com Fixes: `d240629148` ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:25 -08:00
Bruno Faccini	4c80187001	mm/fake-numa: handle cases with no SRAT info Handle more gracefully cases where no SRAT information is available, like in VMs with no Numa support, and allow fake-numa configuration to complete successfully in these cases Link: https://lkml.kernel.org/r/20250127171623.1523171-1-bfaccini@nvidia.com Fixes: `63db8170bf` (“mm/fake-numa: allow later numa node hotplug”) Signed-off-by: Bruno Faccini <bfaccini@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <hyeonggon.yoo@sk.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Len Brown <lenb@kernel.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:25 -08:00
Catalin Marinas	488b5b9eca	mm: kmemleak: fix upper boundary check for physical address objects Memblock allocations are registered by kmemleak separately, based on their physical address. During the scanning stage, it checks whether an object is within the min_low_pfn and max_low_pfn boundaries and ignores it otherwise. With the recent addition of __percpu pointer leak detection (commit `6c99d4eb7c` ("kmemleak: enable tracking for percpu pointers")), kmemleak started reporting leaks in setup_zone_pageset() and setup_per_cpu_pageset(). These were caused by the node_data[0] object (initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn) boundary. The non-strict upper boundary check introduced by commit `84c3262991` ("mm: kmemleak: check physical address when scan") causes the pg_data_t object to be ignored (not scanned) and the __percpu pointers it contains to be reported as leaks. Make the max_low_pfn upper boundary check strict when deciding whether to ignore a physical address object and not scan it. Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com Fixes: `84c3262991` ("mm: kmemleak: check physical address when scan") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Cc: <stable@vger.kernel.org> [6.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:25 -08:00
Hamza Mahfooz	c3d8ced37e	mailmap: add an entry for Hamza Mahfooz Map my previous work email to my current one. Link: https://lkml.kernel.org/r/20250120205659.139027-1-hamzamahfooz@linux.microsoft.com Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: David S. Miller <davem@davemloft.net> Cc: Hans verkuil <hverkuil@xs4all.nl> Cc: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:24 -08:00
Yosry Ahmed	bc40470134	MAINTAINERS: mailmap: update Yosry Ahmed's email address Moving to a linux.dev email address. Link: https://lkml.kernel.org/r/20250123231344.817358-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:24 -08:00
Jan Kiszka	4ebc417ef9	scripts/gdb: fix aarch64 userspace detection in get_current_task At least recent gdb releases (seen with 14.2) return SP_EL0 as signed long which lets the right-shift always return 0. Link: https://lkml.kernel.org/r/dcd2fabc-9131-4b48-8419-6444e2d67454@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Barry Song <baohua@kernel.org> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:24 -08:00
Li Zhijian	a479b078fd	mm/vmscan: accumulate nr_demoted for accurate demotion statistics In shrink_folio_list(), demote_folio_list() can be called 2 times. Currently stat->nr_demoted will only store the last nr_demoted( the later nr_demoted is always zero, the former nr_demoted will get lost), as a result number of demoted pages is not accurate. Accumulate the nr_demoted count across multiple calls to demote_folio_list(), ensuring accurate reporting of demotion statistics. [lizhijian@fujitsu.com: introduce local nr_demoted to fix nr_reclaimed double counting] Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com Fixes: `f77f0c7514` ("mm,memcg: provide per-cgroup counters for NUMA balancing operations") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-02-01 03:53:24 -08:00

1 2 3 4 5 ...

1335480 Commits