mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-09-04 20:19:47 +08:00 
			
		
		
		
	 d723b99ec9
			
		
	
	
		d723b99ec9
		
	
	
	
	
		
			
			systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl8/Q9YACgkQ8vlZVpUN gaPz+wgAkiWwpge0pfcukABW9FcHK9R82IPggA/NnFu0I+3trpqVQP8mYWqg+1l7 X0W6B6GHMcITGdwxVDNGHHv0WabXCqFPT0ENwW1cnl9UL6I91Ev2NjmG9HP6hVZa g3+NyXJwiOP38xsxpPJGPoYFw2wZyv8/e41MMnsE6goYjMmB04sHvXCUQkbN41Fn 3CMdsiueYZDAKflvAlL50Jy7Imz5tq9oy81/z+amqvWo4T0U8zRwQuf25nBAhr25 1WdT4CbCNGO2Qwyu9X+t/KGNVIQhCctkx/yz71l3p2piEGkw/XE4VJNrkmWb0zN7 k9F5uGOZlAlQEzx+5PN//Qtz1Db0QQ== =E6vv -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improvements to ext4's block allocator performance for very large file systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: limit the length of per-inode prealloc list ext4: reorganize if statement of ext4_mb_release_context() ext4: add mb_debug logging when there are lost chunks ext4: Fix comment typo "the the". jbd2: clean up checksum verification in do_one_pass() ext4: change to use fallthrough macro ext4: remove unused parameter of ext4_generic_delete_entry function mballoc: replace seq_printf with seq_puts ext4: optimize the implementation of ext4_mb_good_group() ext4: delete invalid comments near ext4_mb_check_limits() ext4: fix typos in ext4_mb_regular_allocator() comment ext4: fix checking of directory entry validity for inline directories fs: prevent BUG_ON in submit_bh_wbc() ext4: correctly restore system zone info when remount fails ext4: handle add_system_zone() failure in ext4_setup_system_zone() ext4: fold ext4_data_block_valid_rcu() into the caller ext4: check journal inode extents more carefully ext4: don't allow overlapping system zones ext4: handle error of ext4_setup_system_zone() on remount ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() ...
		
			
				
	
	
		
			628 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			628 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. SPDX-License-Identifier: GPL-2.0
 | |
| 
 | |
| ========================
 | |
| ext4 General Information
 | |
| ========================
 | |
| 
 | |
| Ext4 is an advanced level of the ext3 filesystem which incorporates
 | |
| scalability and reliability enhancements for supporting large filesystems
 | |
| (64 bit) in keeping with increasing disk capacities and state-of-the-art
 | |
| feature requirements.
 | |
| 
 | |
| Mailing list:	linux-ext4@vger.kernel.org
 | |
| Web site:	http://ext4.wiki.kernel.org
 | |
| 
 | |
| 
 | |
| Quick usage instructions
 | |
| ========================
 | |
| 
 | |
| Note: More extensive information for getting started with ext4 can be
 | |
| found at the ext4 wiki site at the URL:
 | |
| http://ext4.wiki.kernel.org/index.php/Ext4_Howto
 | |
| 
 | |
|   - The latest version of e2fsprogs can be found at:
 | |
| 
 | |
|     https://www.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
 | |
| 
 | |
| 	or
 | |
| 
 | |
|     http://sourceforge.net/project/showfiles.php?group_id=2406
 | |
| 
 | |
| 	or grab the latest git repository from:
 | |
| 
 | |
|    https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
 | |
| 
 | |
|   - Create a new filesystem using the ext4 filesystem type:
 | |
| 
 | |
|         # mke2fs -t ext4 /dev/hda1
 | |
| 
 | |
|     Or to configure an existing ext3 filesystem to support extents:
 | |
| 
 | |
| 	# tune2fs -O extents /dev/hda1
 | |
| 
 | |
|     If the filesystem was created with 128 byte inodes, it can be
 | |
|     converted to use 256 byte for greater efficiency via:
 | |
| 
 | |
|         # tune2fs -I 256 /dev/hda1
 | |
| 
 | |
|   - Mounting:
 | |
| 
 | |
| 	# mount -t ext4 /dev/hda1 /wherever
 | |
| 
 | |
|   - When comparing performance with other filesystems, it's always
 | |
|     important to try multiple workloads; very often a subtle change in a
 | |
|     workload parameter can completely change the ranking of which
 | |
|     filesystems do well compared to others.  When comparing versus ext3,
 | |
|     note that ext4 enables write barriers by default, while ext3 does
 | |
|     not enable write barriers by default.  So it is useful to use
 | |
|     explicitly specify whether barriers are enabled or not when via the
 | |
|     '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
 | |
|     for a fair comparison.  When tuning ext3 for best benchmark numbers,
 | |
|     it is often worthwhile to try changing the data journaling mode; '-o
 | |
|     data=writeback' can be faster for some workloads.  (Note however that
 | |
|     running mounted with data=writeback can potentially leave stale data
 | |
|     exposed in recently written files in case of an unclean shutdown,
 | |
|     which could be a security exposure in some situations.)  Configuring
 | |
|     the filesystem with a large journal can also be helpful for
 | |
|     metadata-intensive workloads.
 | |
| 
 | |
| Features
 | |
| ========
 | |
| 
 | |
| Currently Available
 | |
| -------------------
 | |
| 
 | |
| * ability to use filesystems > 16TB (e2fsprogs support not available yet)
 | |
| * extent format reduces metadata overhead (RAM, IO for access, transactions)
 | |
| * extent format more robust in face of on-disk corruption due to magics,
 | |
| * internal redundancy in tree
 | |
| * improved file allocation (multi-block alloc)
 | |
| * lift 32000 subdirectory limit imposed by i_links_count[1]
 | |
| * nsec timestamps for mtime, atime, ctime, create time
 | |
| * inode version field on disk (NFSv4, Lustre)
 | |
| * reduced e2fsck time via uninit_bg feature
 | |
| * journal checksumming for robustness, performance
 | |
| * persistent file preallocation (e.g for streaming media, databases)
 | |
| * ability to pack bitmaps and inode tables into larger virtual groups via the
 | |
|   flex_bg feature
 | |
| * large file support
 | |
| * inode allocation using large virtual block groups via flex_bg
 | |
| * delayed allocation
 | |
| * large block (up to pagesize) support
 | |
| * efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
 | |
|   the ordering)
 | |
| * Case-insensitive file name lookups
 | |
| * file-based encryption support (fscrypt)
 | |
| * file-based verity support (fsverity)
 | |
| 
 | |
| [1] Filesystems with a block size of 1k may see a limit imposed by the
 | |
| directory hash tree having a maximum depth of two.
 | |
| 
 | |
| case-insensitive file name lookups
 | |
| ======================================================
 | |
| 
 | |
| The case-insensitive file name lookup feature is supported on a
 | |
| per-directory basis, allowing the user to mix case-insensitive and
 | |
| case-sensitive directories in the same filesystem.  It is enabled by
 | |
| flipping the +F inode attribute of an empty directory.  The
 | |
| case-insensitive string match operation is only defined when we know how
 | |
| text in encoded in a byte sequence.  For that reason, in order to enable
 | |
| case-insensitive directories, the filesystem must have the
 | |
| casefold feature, which stores the filesystem-wide encoding
 | |
| model used.  By default, the charset adopted is the latest version of
 | |
| Unicode (12.1.0, by the time of this writing), encoded in the UTF-8
 | |
| form.  The comparison algorithm is implemented by normalizing the
 | |
| strings to the Canonical decomposition form, as defined by Unicode,
 | |
| followed by a byte per byte comparison.
 | |
| 
 | |
| The case-awareness is name-preserving on the disk, meaning that the file
 | |
| name provided by userspace is a byte-per-byte match to what is actually
 | |
| written in the disk.  The Unicode normalization format used by the
 | |
| kernel is thus an internal representation, and not exposed to the
 | |
| userspace nor to the disk, with the important exception of disk hashes,
 | |
| used on large case-insensitive directories with DX feature.  On DX
 | |
| directories, the hash must be calculated using the casefolded version of
 | |
| the filename, meaning that the normalization format used actually has an
 | |
| impact on where the directory entry is stored.
 | |
| 
 | |
| When we change from viewing filenames as opaque byte sequences to seeing
 | |
| them as encoded strings we need to address what happens when a program
 | |
| tries to create a file with an invalid name.  The Unicode subsystem
 | |
| within the kernel leaves the decision of what to do in this case to the
 | |
| filesystem, which select its preferred behavior by enabling/disabling
 | |
| the strict mode.  When Ext4 encounters one of those strings and the
 | |
| filesystem did not require strict mode, it falls back to considering the
 | |
| entire string as an opaque byte sequence, which still allows the user to
 | |
| operate on that file, but the case-insensitive lookups won't work.
 | |
| 
 | |
| Options
 | |
| =======
 | |
| 
 | |
| When mounting an ext4 filesystem, the following option are accepted:
 | |
| (*) == default
 | |
| 
 | |
|   ro
 | |
|         Mount filesystem read only. Note that ext4 will replay the journal (and
 | |
|         thus write to the partition) even when mounted "read only". The mount
 | |
|         options "ro,noload" can be used to prevent writes to the filesystem.
 | |
| 
 | |
|   journal_checksum
 | |
|         Enable checksumming of the journal transactions.  This will allow the
 | |
|         recovery code in e2fsck and the kernel to detect corruption in the
 | |
|         kernel.  It is a compatible change and will be ignored by older
 | |
|         kernels.
 | |
| 
 | |
|   journal_async_commit
 | |
|         Commit block can be written to disk without waiting for descriptor
 | |
|         blocks. If enabled older kernels cannot mount the device. This will
 | |
|         enable 'journal_checksum' internally.
 | |
| 
 | |
|   journal_path=path, journal_dev=devnum
 | |
|         When the external journal device's major/minor numbers have changed,
 | |
|         these options allow the user to specify the new journal location.  The
 | |
|         journal device is identified through either its new major/minor numbers
 | |
|         encoded in devnum, or via a path to the device.
 | |
| 
 | |
|   norecovery, noload
 | |
|         Don't load the journal on mounting.  Note that if the filesystem was
 | |
|         not unmounted cleanly, skipping the journal replay will lead to the
 | |
|         filesystem containing inconsistencies that can lead to any number of
 | |
|         problems.
 | |
| 
 | |
|   data=journal
 | |
|         All data are committed into the journal prior to being written into the
 | |
|         main file system.  Enabling this mode will disable delayed allocation
 | |
|         and O_DIRECT support.
 | |
| 
 | |
|   data=ordered	(*)
 | |
|         All data are forced directly out to the main file system prior to its
 | |
|         metadata being committed to the journal.
 | |
| 
 | |
|   data=writeback
 | |
|         Data ordering is not preserved, data may be written into the main file
 | |
|         system after its metadata has been committed to the journal.
 | |
| 
 | |
|   commit=nrsec	(*)
 | |
|         This setting limits the maximum age of the running transaction to
 | |
|         'nrsec' seconds.  The default value is 5 seconds.  This means that if
 | |
|         you lose your power, you will lose as much as the latest 5 seconds of
 | |
|         metadata changes (your filesystem will not be damaged though, thanks
 | |
|         to the journaling). This default value (or any low value) will hurt
 | |
|         performance, but it's good for data-safety.  Setting it to 0 will have
 | |
|         the same effect as leaving it at the default (5 seconds).  Setting it
 | |
|         to very large values will improve performance.  Note that due to
 | |
|         delayed allocation even older data can be lost on power failure since
 | |
|         writeback of those data begins only after time set in
 | |
|         /proc/sys/vm/dirty_expire_centisecs.
 | |
| 
 | |
|   barrier=<0|1(*)>, barrier(*), nobarrier
 | |
|         This enables/disables the use of write barriers in the jbd code.
 | |
|         barrier=0 disables, barrier=1 enables.  This also requires an IO stack
 | |
|         which can support barriers, and if jbd gets an error on a barrier
 | |
|         write, it will disable again with a warning.  Write barriers enforce
 | |
|         proper on-disk ordering of journal commits, making volatile disk write
 | |
|         caches safe to use, at some performance penalty.  If your disks are
 | |
|         battery-backed in one way or another, disabling barriers may safely
 | |
|         improve performance.  The mount options "barrier" and "nobarrier" can
 | |
|         also be used to enable or disable barriers, for consistency with other
 | |
|         ext4 mount options.
 | |
| 
 | |
|   inode_readahead_blks=n
 | |
|         This tuning parameter controls the maximum number of inode table blocks
 | |
|         that ext4's inode table readahead algorithm will pre-read into the
 | |
|         buffer cache.  The default value is 32 blocks.
 | |
| 
 | |
|   nouser_xattr
 | |
|         Disables Extended User Attributes.  See the attr(5) manual page for
 | |
|         more information about extended attributes.
 | |
| 
 | |
|   noacl
 | |
|         This option disables POSIX Access Control List support. If ACL support
 | |
|         is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL
 | |
|         is enabled by default on mount. See the acl(5) manual page for more
 | |
|         information about acl.
 | |
| 
 | |
|   bsddf	(*)
 | |
|         Make 'df' act like BSD.
 | |
| 
 | |
|   minixdf
 | |
|         Make 'df' act like Minix.
 | |
| 
 | |
|   debug
 | |
|         Extra debugging information is sent to syslog.
 | |
| 
 | |
|   abort
 | |
|         Simulate the effects of calling ext4_abort() for debugging purposes.
 | |
|         This is normally used while remounting a filesystem which is already
 | |
|         mounted.
 | |
| 
 | |
|   errors=remount-ro
 | |
|         Remount the filesystem read-only on an error.
 | |
| 
 | |
|   errors=continue
 | |
|         Keep going on a filesystem error.
 | |
| 
 | |
|   errors=panic
 | |
|         Panic and halt the machine if an error occurs.  (These mount options
 | |
|         override the errors behavior specified in the superblock, which can be
 | |
|         configured using tune2fs)
 | |
| 
 | |
|   data_err=ignore(*)
 | |
|         Just print an error message if an error occurs in a file data buffer in
 | |
|         ordered mode.
 | |
|   data_err=abort
 | |
|         Abort the journal if an error occurs in a file data buffer in ordered
 | |
|         mode.
 | |
| 
 | |
|   grpid | bsdgroups
 | |
|         New objects have the group ID of their parent.
 | |
| 
 | |
|   nogrpid (*) | sysvgroups
 | |
|         New objects have the group ID of their creator.
 | |
| 
 | |
|   resgid=n
 | |
|         The group ID which may use the reserved blocks.
 | |
| 
 | |
|   resuid=n
 | |
|         The user ID which may use the reserved blocks.
 | |
| 
 | |
|   sb=
 | |
|         Use alternate superblock at this location.
 | |
| 
 | |
|   quota, noquota, grpquota, usrquota
 | |
|         These options are ignored by the filesystem. They are used only by
 | |
|         quota tools to recognize volumes where quota should be turned on. See
 | |
|         documentation in the quota-tools package for more details
 | |
|         (http://sourceforge.net/projects/linuxquota).
 | |
| 
 | |
|   jqfmt=<quota type>, usrjquota=<file>, grpjquota=<file>
 | |
|         These options tell filesystem details about quota so that quota
 | |
|         information can be properly updated during journal replay. They replace
 | |
|         the above quota options. See documentation in the quota-tools package
 | |
|         for more details (http://sourceforge.net/projects/linuxquota).
 | |
| 
 | |
|   stripe=n
 | |
|         Number of filesystem blocks that mballoc will try to use for allocation
 | |
|         size and alignment. For RAID5/6 systems this should be the number of
 | |
|         data disks *  RAID chunk size in file system blocks.
 | |
| 
 | |
|   delalloc	(*)
 | |
|         Defer block allocation until just before ext4 writes out the block(s)
 | |
|         in question.  This allows ext4 to better allocation decisions more
 | |
|         efficiently.
 | |
| 
 | |
|   nodelalloc
 | |
|         Disable delayed allocation.  Blocks are allocated when the data is
 | |
|         copied from userspace to the page cache, either via the write(2) system
 | |
|         call or when an mmap'ed page which was previously unallocated is
 | |
|         written for the first time.
 | |
| 
 | |
|   max_batch_time=usec
 | |
|         Maximum amount of time ext4 should wait for additional filesystem
 | |
|         operations to be batch together with a synchronous write operation.
 | |
|         Since a synchronous write operation is going to force a commit and then
 | |
|         a wait for the I/O complete, it doesn't cost much, and can be a huge
 | |
|         throughput win, we wait for a small amount of time to see if any other
 | |
|         transactions can piggyback on the synchronous write.   The algorithm
 | |
|         used is designed to automatically tune for the speed of the disk, by
 | |
|         measuring the amount of time (on average) that it takes to finish
 | |
|         committing a transaction.  Call this time the "commit time".  If the
 | |
|         time that the transaction has been running is less than the commit
 | |
|         time, ext4 will try sleeping for the commit time to see if other
 | |
|         operations will join the transaction.   The commit time is capped by
 | |
|         the max_batch_time, which defaults to 15000us (15ms).   This
 | |
|         optimization can be turned off entirely by setting max_batch_time to 0.
 | |
| 
 | |
|   min_batch_time=usec
 | |
|         This parameter sets the commit time (as described above) to be at least
 | |
|         min_batch_time.  It defaults to zero microseconds.  Increasing this
 | |
|         parameter may improve the throughput of multi-threaded, synchronous
 | |
|         workloads on very fast disks, at the cost of increasing latency.
 | |
| 
 | |
|   journal_ioprio=prio
 | |
|         The I/O priority (from 0 to 7, where 0 is the highest priority) which
 | |
|         should be used for I/O operations submitted by kjournald2 during a
 | |
|         commit operation.  This defaults to 3, which is a slightly higher
 | |
|         priority than the default I/O priority.
 | |
| 
 | |
|   auto_da_alloc(*), noauto_da_alloc
 | |
|         Many broken applications don't use fsync() when replacing existing
 | |
|         files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/
 | |
|         rename("foo.new", "foo"), or worse yet, fd = open("foo",
 | |
|         O_TRUNC)/write(fd,..)/close(fd).  If auto_da_alloc is enabled, ext4
 | |
|         will detect the replace-via-rename and replace-via-truncate patterns
 | |
|         and force that any delayed allocation blocks are allocated such that at
 | |
|         the next journal commit, in the default data=ordered mode, the data
 | |
|         blocks of the new file are forced to disk before the rename() operation
 | |
|         is committed.  This provides roughly the same level of guarantees as
 | |
|         ext3, and avoids the "zero-length" problem that can happen when a
 | |
|         system crashes before the delayed allocation blocks are forced to disk.
 | |
| 
 | |
|   noinit_itable
 | |
|         Do not initialize any uninitialized inode table blocks in the
 | |
|         background.  This feature may be used by installation CD's so that the
 | |
|         install process can complete as quickly as possible; the inode table
 | |
|         initialization process would then be deferred until the next time the
 | |
|         file system is unmounted.
 | |
| 
 | |
|   init_itable=n
 | |
|         The lazy itable init code will wait n times the number of milliseconds
 | |
|         it took to zero out the previous block group's inode table.  This
 | |
|         minimizes the impact on the system performance while file system's
 | |
|         inode table is being initialized.
 | |
| 
 | |
|   discard, nodiscard(*)
 | |
|         Controls whether ext4 should issue discard/TRIM commands to the
 | |
|         underlying block device when blocks are freed.  This is useful for SSD
 | |
|         devices and sparse/thinly-provisioned LUNs, but it is off by default
 | |
|         until sufficient testing has been done.
 | |
| 
 | |
|   nouid32
 | |
|         Disables 32-bit UIDs and GIDs.  This is for interoperability  with
 | |
|         older kernels which only store and expect 16-bit values.
 | |
| 
 | |
|   block_validity(*), noblock_validity
 | |
|         These options enable or disable the in-kernel facility for tracking
 | |
|         filesystem metadata blocks within internal data structures.  This
 | |
|         allows multi- block allocator and other routines to notice bugs or
 | |
|         corrupted allocation bitmaps which cause blocks to be allocated which
 | |
|         overlap with filesystem metadata blocks.
 | |
| 
 | |
|   dioread_lock, dioread_nolock
 | |
|         Controls whether or not ext4 should use the DIO read locking. If the
 | |
|         dioread_nolock option is specified ext4 will allocate uninitialized
 | |
|         extent before buffer write and convert the extent to initialized after
 | |
|         IO completes. This approach allows ext4 code to avoid using inode
 | |
|         mutex, which improves scalability on high speed storages. However this
 | |
|         does not work with data journaling and dioread_nolock option will be
 | |
|         ignored with kernel warning. Note that dioread_nolock code path is only
 | |
|         used for extent-based files.  Because of the restrictions this options
 | |
|         comprises it is off by default (e.g. dioread_lock).
 | |
| 
 | |
|   max_dir_size_kb=n
 | |
|         This limits the size of directories so that any attempt to expand them
 | |
|         beyond the specified limit in kilobytes will cause an ENOSPC error.
 | |
|         This is useful in memory constrained environments, where a very large
 | |
|         directory can cause severe performance problems or even provoke the Out
 | |
|         Of Memory killer.  (For example, if there is only 512mb memory
 | |
|         available, a 176mb directory may seriously cramp the system's style.)
 | |
| 
 | |
|   i_version
 | |
|         Enable 64-bit inode version support. This option is off by default.
 | |
| 
 | |
|   dax
 | |
|         Use direct access (no page cache).  See
 | |
|         Documentation/filesystems/dax.txt.  Note that this option is
 | |
|         incompatible with data=journal.
 | |
| 
 | |
|   inlinecrypt
 | |
|         When possible, encrypt/decrypt the contents of encrypted files using the
 | |
|         blk-crypto framework rather than filesystem-layer encryption. This
 | |
|         allows the use of inline encryption hardware. The on-disk format is
 | |
|         unaffected. For more details, see
 | |
|         Documentation/block/inline-encryption.rst.
 | |
| 
 | |
| Data Mode
 | |
| =========
 | |
| There are 3 different data modes:
 | |
| 
 | |
| * writeback mode
 | |
| 
 | |
|   In data=writeback mode, ext4 does not journal data at all.  This mode provides
 | |
|   a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
 | |
|   mode - metadata journaling.  A crash+recovery can cause incorrect data to
 | |
|   appear in files which were written shortly before the crash.  This mode will
 | |
|   typically provide the best ext4 performance.
 | |
| 
 | |
| * ordered mode
 | |
| 
 | |
|   In data=ordered mode, ext4 only officially journals metadata, but it logically
 | |
|   groups metadata information related to data changes with the data blocks into
 | |
|   a single unit called a transaction.  When it's time to write the new metadata
 | |
|   out to disk, the associated data blocks are written first.  In general, this
 | |
|   mode performs slightly slower than writeback but significantly faster than
 | |
|   journal mode.
 | |
| 
 | |
| * journal mode
 | |
| 
 | |
|   data=journal mode provides full data and metadata journaling.  All new data is
 | |
|   written to the journal first, and then to its final location.  In the event of
 | |
|   a crash, the journal can be replayed, bringing both data and metadata into a
 | |
|   consistent state.  This mode is the slowest except when data needs to be read
 | |
|   from and written to disk at the same time where it outperforms all others
 | |
|   modes.  Enabling this mode will disable delayed allocation and O_DIRECT
 | |
|   support.
 | |
| 
 | |
| /proc entries
 | |
| =============
 | |
| 
 | |
| Information about mounted ext4 file systems can be found in
 | |
| /proc/fs/ext4.  Each mounted filesystem will have a directory in
 | |
| /proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
 | |
| /proc/fs/ext4/dm-0).   The files in each per-device directory are shown
 | |
| in table below.
 | |
| 
 | |
| Files in /proc/fs/ext4/<devname>
 | |
| 
 | |
|   mb_groups
 | |
|         details of multiblock allocator buddy cache of free blocks
 | |
| 
 | |
| /sys entries
 | |
| ============
 | |
| 
 | |
| Information about mounted ext4 file systems can be found in
 | |
| /sys/fs/ext4.  Each mounted filesystem will have a directory in
 | |
| /sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
 | |
| /sys/fs/ext4/dm-0).   The files in each per-device directory are shown
 | |
| in table below.
 | |
| 
 | |
| Files in /sys/fs/ext4/<devname>:
 | |
| 
 | |
| (see also Documentation/ABI/testing/sysfs-fs-ext4)
 | |
| 
 | |
|   delayed_allocation_blocks
 | |
|         This file is read-only and shows the number of blocks that are dirty in
 | |
|         the page cache, but which do not have their location in the filesystem
 | |
|         allocated yet.
 | |
| 
 | |
|   inode_goal
 | |
|         Tuning parameter which (if non-zero) controls the goal inode used by
 | |
|         the inode allocator in preference to all other allocation heuristics.
 | |
|         This is intended for debugging use only, and should be 0 on production
 | |
|         systems.
 | |
| 
 | |
|   inode_readahead_blks
 | |
|         Tuning parameter which controls the maximum number of inode table
 | |
|         blocks that ext4's inode table readahead algorithm will pre-read into
 | |
|         the buffer cache.
 | |
| 
 | |
|   lifetime_write_kbytes
 | |
|         This file is read-only and shows the number of kilobytes of data that
 | |
|         have been written to this filesystem since it was created.
 | |
| 
 | |
|   max_writeback_mb_bump
 | |
|         The maximum number of megabytes the writeback code will try to write
 | |
|         out before move on to another inode.
 | |
| 
 | |
|   mb_group_prealloc
 | |
|         The multiblock allocator will round up allocation requests to a
 | |
|         multiple of this tuning parameter if the stripe size is not set in the
 | |
|         ext4 superblock
 | |
| 
 | |
|   mb_max_inode_prealloc
 | |
|         The maximum length of per-inode ext4_prealloc_space list.
 | |
| 
 | |
|   mb_max_to_scan
 | |
|         The maximum number of extents the multiblock allocator will search to
 | |
|         find the best extent.
 | |
| 
 | |
|   mb_min_to_scan
 | |
|         The minimum number of extents the multiblock allocator will search to
 | |
|         find the best extent.
 | |
| 
 | |
|   mb_order2_req
 | |
|         Tuning parameter which controls the minimum size for requests (as a
 | |
|         power of 2) where the buddy cache is used.
 | |
| 
 | |
|   mb_stats
 | |
|         Controls whether the multiblock allocator should collect statistics,
 | |
|         which are shown during the unmount. 1 means to collect statistics, 0
 | |
|         means not to collect statistics.
 | |
| 
 | |
|   mb_stream_req
 | |
|         Files which have fewer blocks than this tunable parameter will have
 | |
|         their blocks allocated out of a block group specific preallocation
 | |
|         pool, so that small files are packed closely together.  Each large file
 | |
|         will have its blocks allocated out of its own unique preallocation
 | |
|         pool.
 | |
| 
 | |
|   session_write_kbytes
 | |
|         This file is read-only and shows the number of kilobytes of data that
 | |
|         have been written to this filesystem since it was mounted.
 | |
| 
 | |
|   reserved_clusters
 | |
|         This is RW file and contains number of reserved clusters in the file
 | |
|         system which will be used in the specific situations to avoid costly
 | |
|         zeroout, unexpected ENOSPC, or possible data loss. The default is 2% or
 | |
|         4096 clusters, whichever is smaller and this can be changed however it
 | |
|         can never exceed number of clusters in the file system. If there is not
 | |
|         enough space for the reserved space when mounting the file mount will
 | |
|         _not_ fail.
 | |
| 
 | |
| Ioctls
 | |
| ======
 | |
| 
 | |
| Ext4 implements various ioctls which can be used by applications to access
 | |
| ext4-specific functionality. An incomplete list of these ioctls is shown in the
 | |
| table below. This list includes truly ext4-specific ioctls (``EXT4_IOC_*``) as
 | |
| well as ioctls that may have been ext4-specific originally but are now supported
 | |
| by some other filesystem(s) too (``FS_IOC_*``).
 | |
| 
 | |
| Table of Ext4 ioctls
 | |
| 
 | |
|   FS_IOC_GETFLAGS
 | |
|         Get additional attributes associated with inode.  The ioctl argument is
 | |
|         an integer bitfield, with bit values described in ext4.h.
 | |
| 
 | |
|   FS_IOC_SETFLAGS
 | |
|         Set additional attributes associated with inode.  The ioctl argument is
 | |
|         an integer bitfield, with bit values described in ext4.h.
 | |
| 
 | |
|   EXT4_IOC_GETVERSION, EXT4_IOC_GETVERSION_OLD
 | |
|         Get the inode i_generation number stored for each inode. The
 | |
|         i_generation number is normally changed only when new inode is created
 | |
|         and it is particularly useful for network filesystems. The '_OLD'
 | |
|         version of this ioctl is an alias for FS_IOC_GETVERSION.
 | |
| 
 | |
|   EXT4_IOC_SETVERSION, EXT4_IOC_SETVERSION_OLD
 | |
|         Set the inode i_generation number stored for each inode. The '_OLD'
 | |
|         version of this ioctl is an alias for FS_IOC_SETVERSION.
 | |
| 
 | |
|   EXT4_IOC_GROUP_EXTEND
 | |
|         This ioctl has the same purpose as the resize mount option. It allows
 | |
|         to resize filesystem to the end of the last existing block group,
 | |
|         further resize has to be done with resize2fs, either online, or
 | |
|         offline. The argument points to the unsigned logn number representing
 | |
|         the filesystem new block count.
 | |
| 
 | |
|   EXT4_IOC_MOVE_EXT
 | |
|         Move the block extents from orig_fd (the one this ioctl is pointing to)
 | |
|         to the donor_fd (the one specified in move_extent structure passed as
 | |
|         an argument to this ioctl). Then, exchange inode metadata between
 | |
|         orig_fd and donor_fd.  This is especially useful for online
 | |
|         defragmentation, because the allocator has the opportunity to allocate
 | |
|         moved blocks better, ideally into one contiguous extent.
 | |
| 
 | |
|   EXT4_IOC_GROUP_ADD
 | |
|         Add a new group descriptor to an existing or new group descriptor
 | |
|         block. The new group descriptor is described by ext4_new_group_input
 | |
|         structure, which is passed as an argument to this ioctl. This is
 | |
|         especially useful in conjunction with EXT4_IOC_GROUP_EXTEND, which
 | |
|         allows online resize of the filesystem to the end of the last existing
 | |
|         block group.  Those two ioctls combined is used in userspace online
 | |
|         resize tool (e.g. resize2fs).
 | |
| 
 | |
|   EXT4_IOC_MIGRATE
 | |
|         This ioctl operates on the filesystem itself.  It converts (migrates)
 | |
|         ext3 indirect block mapped inode to ext4 extent mapped inode by walking
 | |
|         through indirect block mapping of the original inode and converting
 | |
|         contiguous block ranges into ext4 extents of the temporary inode. Then,
 | |
|         inodes are swapped. This ioctl might help, when migrating from ext3 to
 | |
|         ext4 filesystem, however suggestion is to create fresh ext4 filesystem
 | |
|         and copy data from the backup. Note, that filesystem has to support
 | |
|         extents for this ioctl to work.
 | |
| 
 | |
|   EXT4_IOC_ALLOC_DA_BLKS
 | |
|         Force all of the delay allocated blocks to be allocated to preserve
 | |
|         application-expected ext3 behaviour. Note that this will also start
 | |
|         triggering a write of the data blocks, but this behaviour may change in
 | |
|         the future as it is not necessary and has been done this way only for
 | |
|         sake of simplicity.
 | |
| 
 | |
|   EXT4_IOC_RESIZE_FS
 | |
|         Resize the filesystem to a new size.  The number of blocks of resized
 | |
|         filesystem is passed in via 64 bit integer argument.  The kernel
 | |
|         allocates bitmaps and inode table, the userspace tool thus just passes
 | |
|         the new number of blocks.
 | |
| 
 | |
|   EXT4_IOC_SWAP_BOOT
 | |
|         Swap i_blocks and associated attributes (like i_blocks, i_size,
 | |
|         i_flags, ...) from the specified inode with inode EXT4_BOOT_LOADER_INO
 | |
|         (#5). This is typically used to store a boot loader in a secure part of
 | |
|         the filesystem, where it can't be changed by a normal user by accident.
 | |
|         The data blocks of the previous boot loader will be associated with the
 | |
|         given inode.
 | |
| 
 | |
| References
 | |
| ==========
 | |
| 
 | |
| kernel source:	<file:fs/ext4/>
 | |
| 		<file:fs/jbd2/>
 | |
| 
 | |
| programs:	http://e2fsprogs.sourceforge.net/
 | |
| 
 | |
| useful links:	https://fedoraproject.org/wiki/ext3-devel
 | |
| 		http://www.bullopensource.org/ext4/
 | |
| 		http://ext4.wiki.kernel.org/index.php/Main_Page
 | |
| 		https://fedoraproject.org/wiki/Features/Ext4
 |