2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

5354 Commits

Author SHA1 Message Date
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
d87d73895f New ext4 features and performance improvements:
* Fast commit performance improvements
    * Multi-fsblock atomic write support for bigalloc file systems
    * Large folio support for regular files
 
 This last can result in really stupendous performance for the right
 workloads.  For example, see [1] where the Kernel Test Robot reported
 over 37% improvement on a large sequential I/O workload.
 
 [1] https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/
 
 There are also the usual bug fixes and cleanups.  Of note are cleanups
 of the extent status tree to fix potential races that could result in
 the extent status tree getting corrupted under heavy siulatneous
 allocation and deallocation to a single file.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmg2GJgACgkQ8vlZVpUN
 gaOmmgf/fEh2OPDG6aAJpQ6hjy2WIbrxqTyuWC/+AFnyI5/Jy0Iskis3lHBiFdKP
 IFjgC1h9CB5ARVvOLd7NOgflPHSSHsnYqoTCS6J4tdWvFN4VRiHe2J3fdTZd/bea
 dzdWniHS3SAJiQm4wvbkhluFgecItBHYzDltapkHI0OGepxZt3thWVvbay6veO9R
 ChXQ7T7/9eUZa5N5IVUeJmWobgh0RD+DgtwCih59UDfnezGqiDr6/shpyNC6EvWV
 oZdvJw2+2DCPn5+DF4Ut77mLpKnxorQ4osNPOovZf59JnSyEcCmbBDuvyNfRldfC
 yQYoCFkOv0Fz8tgJbtoAN71+YXl66w==
 =fxDh
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "New ext4 features and performance improvements:

   - Fast commit performance improvements

   - Multi-fsblock atomic write support for bigalloc file systems

   - Large folio support for regular files

  This last can result in really stupendous performance for the right
  workloads. For example, see [1] where the Kernel Test Robot reported
  over 37% improvement on a large sequential I/O workload.

  There are also the usual bug fixes and cleanups. Of note are cleanups
  of the extent status tree to fix potential races that could result in
  the extent status tree getting corrupted under heavy simultaneous
  allocation and deallocation to a single file"

Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1]

* tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
  ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead
  ext4: Simplify flags in ext4_map_query_blocks()
  ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER
  ext4: Simplify last in leaf check in ext4_map_query_blocks
  ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE
  ext4: only dirty folios when data journaling regular files
  ext4: Add atomic block write documentation
  ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
  ext4: Add multi-fsblock atomic write support with bigalloc
  ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
  ext4: Make ext4_meta_trans_blocks() non-static for later use
  ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
  ext4: Document an edge case for overwrites
  jbd2: remove journal_t argument from jbd2_superblock_csum()
  jbd2: remove journal_t argument from jbd2_chksum()
  ext4: remove sb argument from ext4_superblock_csum()
  ext4: remove sbi argument from ext4_chksum()
  ext4: enable large folio for regular file
  ext4: make online defragmentation support large folios
  ext4: make the writeback path support large folios
  ...
2025-05-28 12:12:08 -07:00
Linus Torvalds
f83fcb87f8 xfs: New code for 6.16
Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCaDQXTQAKCRBcsMJ8RxYu
 YwUHAYDYYm9oit6AIr0AgTXBMJ+DHyqaszBy0VT2jQUP+yXxyrQc46QExXKU9YQV
 ffmGRAsBgN7ZdDI8D5qWySyOynB3b1Jn3/0jY82GscFK0k0oX3EtxbN9MdrovbgK
 qyO66BVx7w==
 =pG5y
 -----END PGP SIGNATURE-----

Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Carlos Maiolino:

 - Atomic writes for XFS

 - Remove experimental warnings for pNFS, scrub and parent pointers

* tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
  xfs: add inode to zone caching for data placement
  xfs: free the item in xfs_mru_cache_insert on failure
  xfs: remove the EXPERIMENTAL warning for pNFS
  xfs: remove some EXPERIMENTAL warnings
  xfs: Remove deprecated xfs_bufd sysctl parameters
  xfs: stop using set_blocksize
  xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  xfs: update atomic write limits
  xfs: add xfs_calc_atomic_write_unit_max()
  xfs: add xfs_file_dio_write_atomic()
  xfs: commit CoW-based atomic writes atomically
  xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
  xfs: add xfs_atomic_write_cow_iomap_begin()
  xfs: refine atomic write size check in xfs_file_write_iter()
  xfs: refactor xfs_reflink_end_cow_extent()
  xfs: allow block allocator to take an alignment hint
  xfs: ignore HW which cannot atomic write a single block
  xfs: add helpers to compute transaction reservation for finishing intent items
  xfs: add helpers to compute log item overhead
  xfs: separate out setting buftarg atomic writes limits
  ...
2025-05-26 12:56:01 -07:00
Ritesh Harjani (IBM)
7acd1b315c ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead
We added the documentation in ext4_map_blocks() for usage of
EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF flag. But It's better to add
a WARN_ON_ONCE in case if anyone tries using this flag with CREATE to
avoid a random issue later. Since depth can change with CREATE and it
needs to be re-calculated before using it in there.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/ee6e82a224c50b432df9ce1ce3333c50182d8473.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)
618320daa9 ext4: Simplify flags in ext4_map_query_blocks()
Now that we have EXT4_EX_QUERY_FILTER mask, let's use that to simplify
the filtering of flags for passing to ext4_ext_map_blocks() in
ext4_map_query_blocks() function. This allows us to kill the query_flags
local variable which is not needed anymore.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/4ae735e83e6f43341e53e2d289e59156a8360134.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)
9597376bdb ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER
Rename EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER to better describe its
purpose as a filter mask used specifically in ext4_map_query_blocks().
Add a comment explaining that this macro is used to filter flags needed
when querying the on-disk extent tree.

We will later use EXT4_EX_QUERY_FILTER mask to add another
EXT4_GET_BLOCKS_QUERY needed to lookup in on-disk extent tree.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/51f05d0ba286372eb8693af95bd4b10194b53141.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)
797ac3ca24 ext4: Simplify last in leaf check in ext4_map_query_blocks
This simplifies the check for last in leaf in ext4_map_query_blocks()
and fixes this cocci warning.

cocci warnings: (new ones prefixed by >>)
>> fs/ext4/inode.c:573:49-51: WARNING !A || A && B is equivalent to !A || B

Fixes: 5bb12b1837 ("ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505191524.auftmOwK-lkp@intel.com/
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/5fd5c806218c83f603c578c95997cf7f6da29d74.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)
3be4bb7e71 ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE
This fixes the atomic write patch series after it was rebased on top of
extent status cache cleanup series i.e.

'commit 402e38e6b7 ("ext4: prevent stale extent cache entries caused by
concurrent I/O writeback")'

After the above series, EXT4_GET_BLOCKS_IO_CONVERT_EXT flag which has
EXT4_GET_BLOCKS_IO_SUBMIT flag set, requires that the io submit context
of any kind should pass EXT4_EX_NOCACHE to avoid caching unncecessary
extents in the extent status cache.

This patch fixes that by adding the EXT4_EX_NOCACHE flag in
ext4_convert_unwritten_extents_atomic() for unwritten to written
conversion calls to ext4_map_blocks().

Fixes: b86629c2b2 ("ext4: Add multi-fsblock atomic write support with bigalloc")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/ea0ad9378ff6d31d73f4e53f87548e3a20817689.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Brian Foster
e26268ff1d ext4: only dirty folios when data journaling regular files
fstest generic/388 occasionally reproduces a crash that looks as
follows:

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
Call Trace:
 <TASK>
 ext4_block_zero_page_range+0x30c/0x380 [ext4]
 ext4_truncate+0x436/0x440 [ext4]
 ext4_process_orphan+0x5d/0x110 [ext4]
 ext4_orphan_cleanup+0x124/0x4f0 [ext4]
 ext4_fill_super+0x262d/0x3110 [ext4]
 get_tree_bdev_flags+0x132/0x1d0
 vfs_get_tree+0x26/0xd0
 vfs_cmd_create+0x59/0xe0
 __do_sys_fsconfig+0x4ed/0x6b0
 do_syscall_64+0x82/0x170
 ...

This occurs when processing a symlink inode from the orphan list. The
partial block zeroing code in the truncate path calls
ext4_dirty_journalled_data() -> folio_mark_dirty(). The latter calls
mapping->a_ops->dirty_folio(), but symlink inodes are not assigned an
a_ops vector in ext4, hence the crash.

To avoid this problem, update the ext4_dirty_journalled_data() helper to
only mark the folio dirty on regular files (for which a_ops is
assigned). This also matches the journaling logic in the ext4_symlink()
creation path, where ext4_handle_dirty_metadata() is called directly.

Fixes: d84c9ebdac ("ext4: Mark pages with journalled data dirty")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20250516173800.175577-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2025-05-20 10:31:13 -04:00
Ritesh Harjani (IBM)
642e0dc73c ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
Last couple of patches added the needed support for multi-fsblock atomic
writes using bigalloc. This patch ensures that filesystem advertizes the
needed atomic write unit min and max values for enabling multi-fsblock
atomic write support with bigalloc.

Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/5e45d7ed24499024b9079436ba6698dae5298e29.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
b86629c2b2 ext4: Add multi-fsblock atomic write support with bigalloc
EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
    mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>

With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize]. This
patch adds the required changes to enable multi-fsblock atomic write
support using bigalloc in the next patch.

In this patch for block allocation:
we first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
   then we don't call ext4_map_blocks() to allocate anything. We don't need to
   even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
   extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
   extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
   ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
   within the requested range. This then provide a single mapped extent type
   mapping for the requested range.

Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.

Note having a single contiguous underlying region of type mapped,
unwrittn or hole is not a problem. But the reason to avoid writing on
top of mixed mapping region is because, atomic writes requires all or
nothing should get written for the userspace pwritev2 request. So if at
any point in time during the write if a crash or a sudden poweroff
occurs, the region undergoing atomic write should read either complete
old data or complete new data. But it should never have a mix of both
old and new data.
So, we first convert any mixed mapping region to a single contiguous
mapped extent before any data gets written to it. This is because
normally FS will only convert unwritten extents to written at the end of
the write in ->end_io() call. And if we allow the writes over a mixed
mapping and if a sudden power off happens in between, we will end up
reading mix of new data (over mapped extents) and old data (over
unwritten extents), because unwritten to written conversion never went
through.
So to avoid this and to avoid writes getting torned due to mixed
mapping, we first allocate a single contiguous block mapping and then
do the write.

Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
5bb12b1837 ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
There can be a case where there are contiguous extents on the adjacent
leaf nodes of on-disk extent trees. So when someone tries to write to
this contiguous range, ext4_map_blocks() call will split by returning
1 extent at a time if this is not already cached in extent_status tree
cache (where if these extents when cached can get merged since they are
contiguous).

This is fine for a normal write however in case of atomic writes, it
can't afford to break the write into two. Now this is also something
that will only happen in the slow write case where we call
ext4_map_blocks() for each of these extents spread across different leaf
nodes. However, there is no guarantee that these extent status cache
cannot be reclaimed before the last call to ext4_map_blocks() in
ext4_map_blocks_atomic_write_slow().

Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
This flag checks if the requested range can be fully found in extent
status cache and return. If not, it looks up in on-disk extent
tree via ext4_map_query_blocks(). If the found extent is the last entry
in the leaf node, then it goes and queries the next lblk to see if there
is an adjacent contiguous extent in the adjacent leaf node of the
on-disk extent tree.

Even though there can be a case where there are multiple adjacent extent
entries spread across multiple leaf nodes. But we only read an adjacent
leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
The reason for this is that we are mostly only going to support atomic
writes with upto 64KB or maybe max upto 1MB of atomic write support.

Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
255e7bc212 ext4: Make ext4_meta_trans_blocks() non-static for later use
Let's make ext4_meta_trans_blocks() non-static for use in later
functions during ->end_io conversion for atomic writes.
We will need this function to estimate journal credits for a special
case. Instead of adding another wrapper around it, let's make this
non-static.

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/23ce80d4286f792831ce99d13558182ee228fedb.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
1c972b1d13 ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
EXT4 only supports doing atomic write on inodes which uses extents, so
add a check in ext4_inode_can_atomic_write() which gets called during
open.

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/86bb502c979398a736ab371d8f35f6866a477f6c.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
9fa6121684 ext4: Document an edge case for overwrites
ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before
calling ext4_iomap_begin(). Document this above ext4_map_blocks() call
as it is easy to miss it when focusing on write paths alone.

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/fd50ba05440042dff77d555e463a620a79f8d0e9.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Eric Biggers
6017dbb7b6 ext4: remove sb argument from ext4_superblock_csum()
Since ext4_superblock_csum() no longer uses its sb argument, remove it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Eric Biggers
6cbab5f95e ext4: remove sbi argument from ext4_chksum()
Since ext4_chksum() no longer uses its sbi argument, remove it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
7ac67301e8 ext4: enable large folio for regular file
Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports
large folios for regular files. Enable this feature by default. However,
since we cannot change the folio order limitation of mappings on active
inodes, setting the journal=data mode via ioctl on an active inode will
not take immediate effect in non-delalloc mode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
01e807e18f ext4: make online defragmentation support large folios
move_extent_per_page() currently assumes that each folio is the size of
PAGE_SIZE and only copies data for one page. ext4_move_extents() should
call move_extent_per_page() for each page. To support larger folios,
simply modify the calculations for the block start and end offsets
within the folio based on the provided range of 'data_offset_in_page'
and 'block_len_in_page'. This function will continue to handle PAGE_SIZE
of data at a time and will not convert this function to manage an entire
folio. Additionally, we use the source folio to copy data, so it doesn't
matter if the source and dest folios are different in size.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
cd9f76de6a ext4: make the writeback path support large folios
In mpage_map_and_submit_buffers(), the 'lblk' is now aligned to
PAGE_SIZE. Convert it to be aligned to folio size. Additionally, modify
the wbc->nr_to_write update to reduce the number of pages in a single
folio, ensuring that the entire writeback path can support large folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
0e32d86170 ext4: correct the journal credits calculations of allocating blocks
The journal credits calculation in ext4_ext_index_trans_blocks() is
currently inadequate. It only multiplies the depth of the extents tree
and doesn't account for the blocks that may be required for adding the
leaf extents themselves.

After enabling large folios, we can easily run out of handle credits,
triggering a warning in jbd2_journal_dirty_metadata() on filesystems
with a 1KB block size. This occurs because we may need more extents when
iterating through each large folio in
ext4_do_writepages()->mpage_map_and_submit_extent(). Therefore, we
should modify ext4_ext_index_trans_blocks() to include a count of the
leaf extents in the worst case as well.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
d6bf294773 ext4/jbd2: convert jbd2_journal_blocks_per_page() to support large folio
jbd2_journal_blocks_per_page() returns the number of blocks in a single
page. Rename it to jbd2_journal_blocks_per_folio() and make it returns
the number of blocks in the largest folio, preparing for the calculation
of journal credits blocks when allocating blocks within a large folio in
the writeback path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
2e9466fc5d ext4: make __ext4_block_zero_page_range() support large folio
The partial block zero range helper __ext4_block_zero_page_range()
currently only supports folios of PAGE_SIZE in size. The calculations
for the start block and the offset within a folio for the given range
are incorrect. Modify the implementation to use offset_in_folio()
instead of directly masking PAGE_SIZE - 1, which will be able to support
for large folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
16705e52e6 ext4: make regular file's buffered write path support large folios
The current buffered write path in ext4 can only allocate and handle
folios of PAGE_SIZE size. To support larger folios, modify
ext4_da_write_begin() and ext4_write_begin() to allocate higher-order
folios, and trim the write length if it exceeds the folio size.
Additionally, in ext4_da_do_write_end(), use offset_in_folio() instead
of PAGE_SIZE.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
fdbd0df9d4 ext4: make ext4_mpage_readpages() support large folios
ext4_mpage_readpages() currently assumes that each folio is the size of
PAGE_SIZE. Modify it to atomically calculate the number of blocks per
folio and iterate through the blocks in each folio, which would allow
for support of larger folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:11 -04:00
Zhang Yi
1a77a028a3 ext4: ensure i_size is smaller than maxbytes
The inode i_size cannot be larger than maxbytes, check it while loading
inode from the disk.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:31:06 -04:00
Zhang Yi
dbe27f06fa ext4: factor out ext4_get_maxbytes()
There are several locations that get the correct maxbytes value based on
the inode's block type. It would be beneficial to extract a common
helper function to make the code more clear.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:30:59 -04:00
Zhang Yi
29ec9bed23 ext4: fix incorrect punch max_end
For the extents based inodes, the maxbytes should be sb->s_maxbytes
instead of sbi->s_bitmap_maxbytes. Additionally, for the calculation of
max_end, the -sb->s_blocksize operation is necessary only for
indirect-block based inodes. Correct the maxbytes and max_end value to
correct the behavior of punch hole.

Fixes: 2da376228a ("ext4: limit length to bitmap_maxbytes - blocksize in punch_hole")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:30:52 -04:00
Zhang Yi
b5e58bcd79 ext4: fix out of bounds punch offset
Punching a hole with a start offset that exceeds max_end is not
permitted and will result in a negative length in the
truncate_inode_partial_folio() function while truncating the page cache,
potentially leading to undesirable consequences.

A simple reproducer:

  truncate -s 9895604649994 /mnt/foo
  xfs_io -c "pwrite 8796093022208 4096" /mnt/foo
  xfs_io -c "fpunch 8796093022213 25769803777" /mnt/foo

  kernel BUG at include/linux/highmem.h:275!
  Oops: invalid opcode: 0000 [#1] SMP PTI
  CPU: 3 UID: 0 PID: 710 Comm: xfs_io Not tainted 6.15.0-rc3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:zero_user_segments.constprop.0+0xd7/0x110
  RSP: 0018:ffffc90001cf3b38 EFLAGS: 00010287
  RAX: 0000000000000005 RBX: ffffea0001485e40 RCX: 0000000000001000
  RDX: 000000000040b000 RSI: 0000000000000005 RDI: 000000000040b000
  RBP: 000000000040affb R08: ffff888000000000 R09: ffffea0000000000
  R10: 0000000000000003 R11: 00000000fffc7fc5 R12: 0000000000000005
  R13: 000000000040affb R14: ffffea0001485e40 R15: ffff888031cd3000
  FS:  00007f4f63d0b780(0000) GS:ffff8880d337d000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000001ae0b038 CR3: 00000000536aa000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   truncate_inode_partial_folio+0x3dd/0x620
   truncate_inode_pages_range+0x226/0x720
   ? bdev_getblk+0x52/0x3e0
   ? ext4_get_group_desc+0x78/0x150
   ? crc32c_arch+0xfd/0x180
   ? __ext4_get_inode_loc+0x18c/0x840
   ? ext4_inode_csum+0x117/0x160
   ? jbd2_journal_dirty_metadata+0x61/0x390
   ? __ext4_handle_dirty_metadata+0xa0/0x2b0
   ? kmem_cache_free+0x90/0x5a0
   ? jbd2_journal_stop+0x1d5/0x550
   ? __ext4_journal_stop+0x49/0x100
   truncate_pagecache_range+0x50/0x80
   ext4_truncate_page_cache_block_range+0x57/0x3a0
   ext4_punch_hole+0x1fe/0x670
   ext4_fallocate+0x792/0x17d0
   ? __count_memcg_events+0x175/0x2a0
   vfs_fallocate+0x121/0x560
   ksys_fallocate+0x51/0xc0
   __x64_sys_fallocate+0x24/0x40
   x64_sys_call+0x18d2/0x4170
   do_syscall_64+0xa7/0x220
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by filtering out cases where the punching start offset exceeds
max_end.

Fixes: 982bf37da0 ("ext4: refactor ext4_punch_hole()")
Reported-by: Liebes Wang <wanghaichi0403@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/ac3a58f6-e686-488b-a9ee-fc041024e43d@huawei.com/
Tested-by: Liebes Wang <wanghaichi0403@gmail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:30:44 -04:00
Christoph Hellwig
e80325ef5c ext4: use writeback_iter in ext4_journalled_submit_inode_data_buffers
Use writeback_iter directly instead of write_cache_pages for a nicer
code structure and less indirect calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250505091604.3449879-1-hch@lst.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:30:37 -04:00
Jan Kara
32a93f5bc9 ext4: fix calculation of credits for extent tree modification
Luis and David are reporting that after running generic/750 test for 90+
hours on 2k ext4 filesystem, they are able to trigger a warning in
jbd2_journal_dirty_metadata() complaining that there are not enough
credits in the running transaction started in ext4_do_writepages().

Indeed the code in ext4_do_writepages() is racy and the extent tree can
change between the time we compute credits necessary for extent tree
computation and the time we actually modify the extent tree. Thus it may
happen that the number of credits actually needed is higher. Modify
ext4_ext_index_trans_blocks() to count with the worst case of maximum
tree depth. This can reduce the possible number of writers that can
operate in the system in parallel (because the credit estimates now won't
fit in one transaction) but for reasonably sized journals this shouldn't
really be an issue. So just go with a safe and simple fix.

Link: https://lore.kernel.org/all/20250415013641.f2ppw6wov4kn4wq2@offworld
Reported-by: Davidlohr Bueso <dave@stgolabs.net>
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: kdevops@lists.linux.dev
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250429175535.23125-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:30:17 -04:00
Arnd Bergmann
d612a07931 ext4: avoid -Wformat-security warning
check_igot_inode() prints a variable string, which causes a harmless
warning with 'make W=1':

fs/ext4/inode.c:4763:45: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security]
 4763 |         ext4_error_inode(inode, function, line, 0, err_str);

Use a trivial "%s" format string instead.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250423164354.2780635-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-16 14:44:59 -04:00
Zhang Yi
24b7a2331f ext4: clairfy the rules for modifying extents
Add a comment at the beginning of extents_status.c to clarify the rules
for loading, mapping, modifying, and removing extents and blocks.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-15 13:14:56 -04:00
Zhang Yi
1b4d2a0b79 ext4: check env when mapping and modifying extents
Add ext4_check_map_extents_env() to the places where loading extents,
mapping blocks, removing blocks, and modifying extents, excluding the
I/O writeback context. This function will verify whether the locking
mechanisms in place are adequate.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-15 13:14:51 -04:00
Zhang Yi
7871da20d4 ext4: introduce ext4_check_map_extents_env() debug helper
Loading and modifying the extents tree and extent status tree without
holding the inode's i_rwsem or the mapping's invalidate_lock is not
permitted, except during the I/O writeback. Add a new debug helper
ext4_check_map_extents_env(), it will verify whether the extent
loading/modifying context is safe.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:13 -04:00
Zhang Yi
0b8e0bd450 ext4: factor out is_special_ino()
Factor out the helper is_special_ino() to facilitate the checking of
special inodes in the subsequent patches.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Zhang Yi
f22a0ef223 ext4: prevent stale extent cache entries caused by concurrent get es_cache
The EXT4_IOC_GET_ES_CACHE and EXT4_IOC_PRECACHE_EXTENTS currently
invokes ext4_ext_precache() to preload the extent cache without holding
the inode's i_rwsem. This can result in stale extent cache entries when
competing with operations such as ext4_collapse_range() which calls
ext4_ext_remove_space() or ext4_ext_shift_extents().

The problem arises when ext4_ext_remove_space() temporarily releases
i_data_sem due to insufficient journal credits. During this interval, a
concurrent EXT4_IOC_GET_ES_CACHE or EXT4_IOC_PRECACHE_EXTENTS may cache
extent entries that are about to be deleted. As a result, these cached
entries become stale and inconsistent with the actual extents.

Loading the extents cache without holding the inode's i_rwsem or the
mapping's invalidate_lock is not permitted besides during the writeback.
Fix this by holding the i_rwsem during EXT4_IOC_GET_ES_CACHE and
EXT4_IOC_PRECACHE_EXTENTS.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Zhang Yi
151ff9325e ext4: prevent stale extent cache entries caused by concurrent fiemap
The ext4_fiemap() currently invokes ext4_ext_precache() and
iomap_fiemap() to preload the extent cache and query mapping information
without holding the inode's i_rwsem. This can result in stale extent
cache entries when competing with operations such as
ext4_collapse_range() which calls ext4_ext_remove_space() or
ext4_ext_shift_extents().

The problem arises when ext4_ext_remove_space() temporarily releases
i_data_sem due to insufficient journal credits. During this interval, a
concurrent ext4_fiemap() may cache extent entries that are about to be
deleted. As a result, these cached entries become stale and inconsistent
with the actual extents.

Loading the extents cache without holding the inode's i_rwsem or the
mapping's invalidate_lock is not permitted besides during the writeback.
Fix this by holding the i_rwsem in ext4_fiemap().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Zhang Yi
402e38e6b7 ext4: prevent stale extent cache entries caused by concurrent I/O writeback
Currently, in the I/O writeback path, ext4_map_blocks() may attempt to
cache additional unrelated extents in the extent status tree without
holding the inode's i_rwsem and the mapping's invalidate_lock. This can
lead to stale extent status entries remaining in certain scenarios,
potentially causing data corruption.

For example, when performing a collapse range in ext4_collapse_range(),
it clears the extent cache and dirty pages before removing blocks and
shifting extents. It also holds the i_data_sem during these two
operations. However, both ext4_ext_remove_space() and
ext4_ext_shift_extents() may briefly release the i_data_sem if journal
credits are insufficient (ext4_datasem_ensure_credits()). If another
writeback process writes dirty pages from other regions during this
interval, it may cache extents that are about to be modified. Unless
ext4_collapse_range() explicitly clears the extent cache again, these
cached entries can become stale and inconsistent with the actual
extents.

     0 a  n       b      c         m
     | |  |       |      |         |
    [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
          |                           |
          N                           M

Assume that block a is dirty. The collapse range operation is removing
data from n to m and drops i_data_sem immediately after removing the
extent from b to c. At the same time, a concurrent writeback begins to
write back block a; it will reloads the extent from [n, b) into the
extent status tree since it does not hold the i_rwsem or the
invalidate_lock. After the collapse range operation, it left the stale
extent [n, b), which points logical block n to N, but the actual
physical block of n should be M.

Similarly, both ext4_insert_range() and ext4_truncate() have the same
problem. ext4_punch_hole() survived since it re-add a hole extent entry
after removing space since commit 9f1118223a ("ext4: add a hole extent
entry in cache after punch").

In most cases, during dirty page writeback, the block mapping
information is likely to be found in the extent cache, making it less
necessary to search for physical extents. Consequently, loading
unrelated extent caches during writeback appears to be ineffective.
Therefore, fix this by adds EXT4_EX_NOCACHE in the writeback path to
prevent caching of unrelated extents, eliminating this potential source
of corruption.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Zhang Yi
86b349ce03 ext4: generalize EXT4_GET_BLOCKS_IO_SUBMIT flag usage
Currently, the EXT4_GET_BLOCKS_IO_SUBMIT flag is only used during data
writeback to indicate that in ordered mode, the journal commit thread
should skip re-submitting data and simply wait for I/O completion.

To prepare for later patches that need to detect I/O submission context
in ext4_map_blocks(), generalizes the meaning of
EXT4_GET_BLOCKS_IO_SUBMIT. This flag will be set during:
 1) data I/O writeback,
 2) I/O completion extents conversion,
 3) journal performing commit in fast_commit.

This change doesn't affect current usage of this flag and provides a
clear way to identify I/O submission context.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Zhang Yi
53ce42accd ext4: ext4: unify EXT4_EX_NOCACHE|NOFAIL flags in ext4_ext_remove_space()
When removing space, we should use EXT4_EX_NOCACHE because we don't
need to cache extents, and we should also use EXT4_EX_NOFAIL to prevent
metadata inconsistencies that may arise from memory allocation failures.
While ext4_ext_remove_space() already uses these two flags in most
places, they are missing in ext4_ext_search_right() and
read_extent_tree_block() calls. Unify the flags to ensure consistent
behavior throughout the extent removal process.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Thadeu Lima de Souza Cascardo
227cb4ca5a ext4: inline: fix len overflow in ext4_prepare_inline_data
When running the following code on an ext4 filesystem with inline_data
feature enabled, it will lead to the bug below.

        fd = open("file1", O_RDWR | O_CREAT | O_TRUNC, 0666);
        ftruncate(fd, 30);
        pwrite(fd, "a", 1, (1UL << 40) + 5UL);

That happens because write_begin will succeed as when
ext4_generic_write_inline_data calls ext4_prepare_inline_data, pos + len
will be truncated, leading to ext4_prepare_inline_data parameter to be 6
instead of 0x10000000006.

Then, later when write_end is called, we hit:

        BUG_ON(pos + len > EXT4_I(inode)->i_inline_size);

at ext4_write_inline_data.

Fix it by using a loff_t type for the len parameter in
ext4_prepare_inline_data instead of an unsigned int.

[   44.545164] ------------[ cut here ]------------
[   44.545530] kernel BUG at fs/ext4/inline.c:240!
[   44.545834] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[   44.546172] CPU: 3 UID: 0 PID: 343 Comm: test Not tainted 6.15.0-rc2-00003-g9080916f4863 #45 PREEMPT(full)  112853fcebfdb93254270a7959841d2c6aa2c8bb
[   44.546523] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   44.546523] RIP: 0010:ext4_write_inline_data+0xfe/0x100
[   44.546523] Code: 3c 0e 48 83 c7 48 48 89 de 5b 41 5c 41 5d 41 5e 41 5f 5d e9 e4 fa 43 01 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 0b <0f> 0b 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 20 49
[   44.546523] RSP: 0018:ffffb342008b79a8 EFLAGS: 00010216
[   44.546523] RAX: 0000000000000001 RBX: ffff9329c579c000 RCX: 0000010000000006
[   44.546523] RDX: 000000000000003c RSI: ffffb342008b79f0 RDI: ffff9329c158e738
[   44.546523] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[   44.546523] R10: 00007ffffffff000 R11: ffffffff9bd0d910 R12: 0000006210000000
[   44.546523] R13: fffffc7e4015e700 R14: 0000010000000005 R15: ffff9329c158e738
[   44.546523] FS:  00007f4299934740(0000) GS:ffff932a60179000(0000) knlGS:0000000000000000
[   44.546523] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   44.546523] CR2: 00007f4299a1ec90 CR3: 0000000002886002 CR4: 0000000000770eb0
[   44.546523] PKRU: 55555554
[   44.546523] Call Trace:
[   44.546523]  <TASK>
[   44.546523]  ext4_write_inline_data_end+0x126/0x2d0
[   44.546523]  generic_perform_write+0x17e/0x270
[   44.546523]  ext4_buffered_write_iter+0xc8/0x170
[   44.546523]  vfs_write+0x2be/0x3e0
[   44.546523]  __x64_sys_pwrite64+0x6d/0xc0
[   44.546523]  do_syscall_64+0x6a/0xf0
[   44.546523]  ? __wake_up+0x89/0xb0
[   44.546523]  ? xas_find+0x72/0x1c0
[   44.546523]  ? next_uptodate_folio+0x317/0x330
[   44.546523]  ? set_pte_range+0x1a6/0x270
[   44.546523]  ? filemap_map_pages+0x6ee/0x840
[   44.546523]  ? ext4_setattr+0x2fa/0x750
[   44.546523]  ? do_pte_missing+0x128/0xf70
[   44.546523]  ? security_inode_post_setattr+0x3e/0xd0
[   44.546523]  ? ___pte_offset_map+0x19/0x100
[   44.546523]  ? handle_mm_fault+0x721/0xa10
[   44.546523]  ? do_user_addr_fault+0x197/0x730
[   44.546523]  ? do_syscall_64+0x76/0xf0
[   44.546523]  ? arch_exit_to_user_mode_prepare+0x1e/0x60
[   44.546523]  ? irqentry_exit_to_user_mode+0x79/0x90
[   44.546523]  entry_SYSCALL_64_after_hwframe+0x55/0x5d
[   44.546523] RIP: 0033:0x7f42999c6687
[   44.546523] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
[   44.546523] RSP: 002b:00007ffeae4a7930 EFLAGS: 00000202 ORIG_RAX: 0000000000000012
[   44.546523] RAX: ffffffffffffffda RBX: 00007f4299934740 RCX: 00007f42999c6687
[   44.546523] RDX: 0000000000000001 RSI: 000055ea6149200f RDI: 0000000000000003
[   44.546523] RBP: 00007ffeae4a79a0 R08: 0000000000000000 R09: 0000000000000000
[   44.546523] R10: 0000010000000005 R11: 0000000000000202 R12: 0000000000000000
[   44.546523] R13: 00007ffeae4a7ac8 R14: 00007f4299b86000 R15: 000055ea61493dd8
[   44.546523]  </TASK>
[   44.546523] Modules linked in:
[   44.568501] ---[ end trace 0000000000000000 ]---
[   44.568889] RIP: 0010:ext4_write_inline_data+0xfe/0x100
[   44.569328] Code: 3c 0e 48 83 c7 48 48 89 de 5b 41 5c 41 5d 41 5e 41 5f 5d e9 e4 fa 43 01 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 0b <0f> 0b 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 20 49
[   44.570931] RSP: 0018:ffffb342008b79a8 EFLAGS: 00010216
[   44.571356] RAX: 0000000000000001 RBX: ffff9329c579c000 RCX: 0000010000000006
[   44.571959] RDX: 000000000000003c RSI: ffffb342008b79f0 RDI: ffff9329c158e738
[   44.572571] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[   44.573148] R10: 00007ffffffff000 R11: ffffffff9bd0d910 R12: 0000006210000000
[   44.573748] R13: fffffc7e4015e700 R14: 0000010000000005 R15: ffff9329c158e738
[   44.574335] FS:  00007f4299934740(0000) GS:ffff932a60179000(0000) knlGS:0000000000000000
[   44.575027] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   44.575520] CR2: 00007f4299a1ec90 CR3: 0000000002886002 CR4: 0000000000770eb0
[   44.576112] PKRU: 55555554
[   44.576338] Kernel panic - not syncing: Fatal exception
[   44.576517] Kernel Offset: 0x1a600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Reported-by: syzbot+fe2a25dae02a207717a0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fe2a25dae02a207717a0
Fixes: f19d5870cb ("ext4: add normal write support for inline data")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20250415-ext4-prepare-inline-overflow-v1-1-f4c13d900967@igalia.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-13 23:02:09 -04:00
Harshad Shirwadkar
6593714d67 ext4: hold s_fc_lock while during fast commit
Leaving s_fc_lock in between during commit in ext4_fc_perform_commit()
function leaves room for subtle concurrency bugs where ext4_fc_del() may
delete an inode from the fast commit list, leaving list in an inconsistent
state.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250508175908.1004880-10-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
12e64e7f85 ext4: convert s_fc_lock to mutex type
This allows us to hold s_fc_lock during kmem_cache_* functions, which
is needed in the following patch.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250508175908.1004880-9-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
86e07d4b9b ext4: temporarily elevate commit thread priority
Unlike JBD2 based full commits, there is no dedicated journal thread
for fast commits. Thus to reduce scheduling delays between IO
submission and completion, temporarily elevate the committer thread's
priority to match the configured priority of the JBD2 journal
thread.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250508175908.1004880-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
69f35ca189 ext4: update code documentation
This patch updates code documentation to reflect the commit path changes
made in this series.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>

code docs

Link: https://patch.msgid.link/20250508175908.1004880-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
ed45d33113 ext4: drop i_fc_updates from inode fc info
The new logic introduced in this series does not require tracking number
of active handles open on an inode. So, drop it.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250508175908.1004880-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
857d32f261 ext4: rework fast commit commit path
This patch reworks fast commit's commit path to remove locking the
journal for the entire duration of a fast commit. Instead, we only lock
the journal while marking all the eligible inodes as "committing". This
allows handles to make progress in parallel with the fast commit.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
0b64fd74dd ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr
Mark inode dirty first and then grab i_data_sem in ext4_setattr().

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
4d3266463e ext4: for committing inode, make ext4_fc_track_inode wait
If the inode that's being requested to track using ext4_fc_track_inode
is being committed, then wait until the inode finishes the
commit. Also, add calls to ext4_fc_track_inode at the right places.
With this patch, now calling ext4_reserve_inode_write() results in
inode being tracked for next fast commit. This ensures that by the
time ext4_reserve_inode_write() returns, it is ready to be modified
and won't be committed until the corresponding handle is open.

A subtle lock ordering requirement with i_data_sem (which is
documented in the code) requires that ext4_fc_track_inode() be called
before grabbing i_data_sem. So, this patch also adds explicit
ext4_fc_track_inode() calls in places where i_data_sem grabbed.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00