2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

988 Commits

Author SHA1 Message Date
Linus Torvalds
89fd915c40 libnvdimm for 4.14
* Media error handling support in the Block Translation Table (BTT)
   driver is reworked to address sleeping-while-atomic locking and
   memory-allocation-context conflicts.
 
 * The dax_device lookup overhead for xfs and ext4 is moved out of the
   iomap hot-path to a mount-time lookup.
 
 * A new 'ecc_unit_size' sysfs attribute is added to advertise the
   read-modify-write boundary property of a persistent memory range.
 
 * Preparatory fix-ups for arm and powerpc pmem support are included
   along with other miscellaneous fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZtsAGAAoJEB7SkWpmfYgCrzMP/2vPvZvrFjZn5pAoZjlmTmHM
 ySceoOC7vwvVXIsSs52FhSjcxEoXo9cklXPwhXOPVtVUFdSDJBUOIUxwIziE6Y+5
 sFJ2xT9K+5zKBUiXJwqFQDg52dn//eBNnnnDz+HQrBSzGrbWQhIZY2m19omPzv1I
 BeN0OCGOdW3cjSo3BCFl1d+KrSl704e7paeKq/TO3GIiAilIXleTVxcefEEodV2K
 ZvWHpFIhHeyN8dsF8teI952KcCT92CT/IaabxQIwCxX0/8/GFeDc5aqf77qiYWKi
 uxCeQXdgnaE8EZNWZWGWIWul6eYEkoCNbLeUQ7eJnECq61VxVajJS0NyGa5T9OiM
 P046Bo2b1b3R0IHxVIyVG0ZCm3YUMAHSn/3uRxPgESJ4bS/VQ3YP5M6MLxDOlc90
 IisLilagitkK6h8/fVuVrwciRNQ71XEC34t6k7GCl/1ZnLlLT+i4/jc5NRZnGEZh
 aXAAGHdteQ+/mSz6p2UISFUekbd6LerwzKRw8ibDvH6pTud8orYR7g2+JoGhgb6Y
 pyFVE8DhIcqNKAMxBsjiRZ46OQ7qrT+AemdAG3aVv6FaNoe4o5jPLdw2cEtLqtpk
 +DNm0/lSWxxxozjrvu6EUZj6hk8R5E19XpRzV5QJkcKUXMu7oSrFLdMcC4FeIjl9
 K4hXLV3fVBVRMiS0RA6z
 =5iGY
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm from Dan Williams:
 "A rework of media error handling in the BTT driver and other updates.
  It has appeared in a few -next releases and collected some late-
  breaking build-error and warning fixups as a result.

  Summary:

   - Media error handling support in the Block Translation Table (BTT)
     driver is reworked to address sleeping-while-atomic locking and
     memory-allocation-context conflicts.

   - The dax_device lookup overhead for xfs and ext4 is moved out of the
     iomap hot-path to a mount-time lookup.

   - A new 'ecc_unit_size' sysfs attribute is added to advertise the
     read-modify-write boundary property of a persistent memory range.

   - Preparatory fix-ups for arm and powerpc pmem support are included
     along with other miscellaneous fixes"

* tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
  libnvdimm, btt: fix format string warnings
  libnvdimm, btt: clean up warning and error messages
  ext4: fix null pointer dereference on sbi
  libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
  dax: fix FS_DAX=n BLOCK=y compilation
  libnvdimm: fix integer overflow static analysis warning
  libnvdimm, nd_blk: remove mmio_flush_range()
  libnvdimm, btt: rework error clearing
  libnvdimm: fix potential deadlock while clearing errors
  libnvdimm, btt: cache sector_size in arena_info
  libnvdimm, btt: ensure that flags were also unchanged during a map_read
  libnvdimm, btt: refactor map entry operations with macros
  libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
  libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
  ext4: perform dax_device lookup at mount
  ext2: perform dax_device lookup at mount
  xfs: perform dax_device lookup at mount
  dax: introduce a fs_dax_get_by_bdev() helper
  libnvdimm, btt: check memory allocation failure
  libnvdimm, label: fix index block size calculation
  ...
2017-09-11 13:10:57 -07:00
Linus Torvalds
ae8ac6b7db Merge branch 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota scaling updates from Jan Kara:
 "This contains changes to make the quota subsystem more scalable.

  Reportedly it improves number of files created per second on ext4
  filesystem on fast storage by about a factor of 2x"

* 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
  quota: Add lock annotations to struct members
  quota: Reduce contention on dq_data_lock
  fs: Provide __inode_get_bytes()
  quota: Inline dquot_[re]claim_reserved_space() into callsite
  quota: Inline inode_{incr,decr}_space() into callsites
  quota: Inline functions into their callsites
  ext4: Disable dirty list tracking of dquots when journalling quotas
  quota: Allow disabling tracking of dirty dquots in a list
  quota: Remove dq_wait_unused from dquot
  quota: Move locking into clear_dquot_dirty()
  quota: Do not dirty bad dquots
  quota: Fix possible corruption of dqi_flags
  quota: Propagate ->quota_read errors from v2_read_file_info()
  quota: Fix error codes in v2_read_file_info()
  quota: Push dqio_sem down to ->read_file_info()
  quota: Push dqio_sem down to ->write_file_info()
  quota: Push dqio_sem down to ->get_next_id()
  quota: Push dqio_sem down to ->release_dqblk()
  quota: Remove locking for writing to the old quota format
  quota: Do not acquire dqio_sem for dquot overwrites in v2 format
  ...
2017-09-07 15:19:35 -07:00
Colin Ian King
aed9eb1b21 ext4: fix null pointer dereference on sbi
In the case of a kzalloc failure when allocating sbi we end up
with a null pointer dereference on sbi when assigning sbi->s_daxdev.
Fix this by moving the assignment of sbi->s_daxdev to after the
null pointer check of sbi.

Detected by CoverityScan CID#1455379 ("Dereference before null check")

Fixes: 5e405595e5 ("ext4: perform dax_device lookup at mount")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-09-05 10:02:08 -07:00
Dan Williams
5e405595e5 ext4: perform dax_device lookup at mount
The ->iomap_begin() operation is a hot path, so cache the
fs_dax_get_by_host() result at mount time to avoid the incurring the
hash lookup overhead on a per-i/o basis.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-31 11:12:13 -07:00
zhangyi (F)
95f1fda47c ext4: fix quota inconsistency during orphan cleanup for read-only mounts
Quota does not get enabled for read-only mounts if filesystem
has quota feature, so that quotas cannot updated during orphan
cleanup, which will lead to quota inconsistency.

This patch turn on quotas during orphan cleanup for this case,
make sure quotas can be updated correctly.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # 3.18+
2017-08-24 15:21:50 -04:00
zhangyi (F)
b0a5a9589d ext4: fix incorrect quotaoff if the quota feature is enabled
Current ext4 quota should always "usage enabled" if the
quota feautre is enabled. But in ext4_orphan_cleanup(), it
turn quotas off directly (used for the older journaled
quota), so we cannot turn it on again via "quotaon" unless
umount and remount ext4.

Simple reproduce:

  mkfs.ext4 -O project,quota /dev/vdb1
  mount -o prjquota /dev/vdb1 /mnt
  chattr -p 123 /mnt
  chattr +P /mnt
  touch /mnt/aa /mnt/bb
  exec 100<>/mnt/aa
  rm -f /mnt/aa
  sync
  echo c > /proc/sysrq-trigger

  #reboot and mount
  mount -o prjquota /dev/vdb1 /mnt
  #query status
  quotaon -Ppv /dev/vdb1
  #output
  quotaon: Cannot find mountpoint for device /dev/vdb1
  quotaon: No correct mountpoint specified.

This patch add check for journaled quotas to avoid incorrect
quotaoff when ext4 has quota feautre.

Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # 3.18
2017-08-24 15:19:39 -04:00
Jan Kara
7b9ca4c61b quota: Reduce contention on dq_data_lock
dq_data_lock is currently used to protect all modifications of quota
accounting information, consistency of quota accounting on the inode,
and dquot pointers from inode. As a result contention on the lock can be
pretty heavy.

Reduce the contention on the lock by protecting quota accounting
information by a new dquot->dq_dqb_lock and consistency of quota
accounting with inode usage by inode->i_lock.

This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 6% when user quota is
turned on. When those 50 processes belong to 50 different users, the
improvement is about 9%.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:07:59 +02:00
Jan Kara
91389240a2 ext4: Disable dirty list tracking of dquots when journalling quotas
When journalling quotas, we writeback all dquots immediately after
changing them as part of current transation. Thus there's no need to
write anything in dquot_writeback_dquots() and so we can avoid updating
list of dirty dquots to reduce dq_list_lock contention.

This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 15% when user quota is
turned on.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 22:00:54 +02:00
Jan Kara
bc8230ee8e quota: Convert dqio_mutex to rwsem
Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
yet.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:52:48 +02:00
Eric Whitney
a627b0a7c1 ext4: remove unused metadata accounting variables
Two variables in ext4_inode_info, i_reserved_meta_blocks and
i_allocated_meta_blocks, are unused.  Removing them saves a little
memory per in-memory inode and cleans up clutter in several tracepoints.
Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
and fix a typo and whitespace near these changes.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-07-30 22:30:11 -04:00
David Howells
bc98a42c1f VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-17 08:45:34 +01:00
Linus Torvalds
bc2c6421cb The first major feature for ext4 this merge window is the largedir
feature, which allows ext4 directories to support over 2 billion
 directory entries (assuming ~64 byte file names; in practice, users
 will run into practical performance limits first.)  This feature was
 originally written by the Lustre team, and credit goes to Artem
 Blagodarenko from Seagate for getting this feature upstream.
 
 The second major major feature allows ext4 to support extended
 attribute values up to 64k.  This feature was also originally from
 Lustre, and has been enhanced by Tahsin Erdogan from Google with a
 deduplication feature so that if multiple files have the same xattr
 value (for example, Windows ACL's stored by Samba), only one copy will
 be stored on disk for encoding and caching efficiency.
 
 We also have the usual set of bug fixes, cleanups, and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhl5AACgkQ8vlZVpUN
 gaOiNQf+L23sT9KIQmFwQP38vkBVw67Eo7gBfevmk7oqQLiRppT5mmLzW8EWEDxR
 PVaDQXvSZi18wSCAAcCd1ZqeIZk0P6tst0ufnIT60tGlZdUlwSLyrqvV/30axR2g
 6kcnv90ZszrQNx5U8q8bMzNrs1KtyPHFCRzavFsBX11WezNSpWnH2in/uxO+t9Jy
 F2zlrLUrE2m9AVMH48Dh6LbeaB6pqgr4k3jq1jG4Iqb2h9xgU8OKhs8gL07YS+Qi
 5A7s8GIvYQSoZUO9DOOie2f1zhpO0KrhXchyZTJukVQH7TsmFxoSh0vhXnP1Bohu
 CNLV6dzetDT0VfmPr1WhVe7lhZeeVw==
 =FFkF
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The first major feature for ext4 this merge window is the largedir
  feature, which allows ext4 directories to support over 2 billion
  directory entries (assuming ~64 byte file names; in practice, users
  will run into practical performance limits first.) This feature was
  originally written by the Lustre team, and credit goes to Artem
  Blagodarenko from Seagate for getting this feature upstream.

  The second major major feature allows ext4 to support extended
  attribute values up to 64k. This feature was also originally from
  Lustre, and has been enhanced by Tahsin Erdogan from Google with a
  deduplication feature so that if multiple files have the same xattr
  value (for example, Windows ACL's stored by Samba), only one copy will
  be stored on disk for encoding and caching efficiency.

  We also have the usual set of bug fixes, cleanups, and optimizations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: fix spelling mistake: "prellocated" -> "preallocated"
  ext4: fix __ext4_new_inode() journal credits calculation
  ext4: skip ext4_init_security() and encryption on ea_inodes
  fs: generic_block_bmap(): initialize all of the fields in the temp bh
  ext4: change fast symlink test to not rely on i_blocks
  ext4: require key for truncate(2) of encrypted file
  ext4: don't bother checking for encryption key in ->mmap()
  ext4: check return value of kstrtoull correctly in reserved_clusters_store
  ext4: fix off-by-one fsmap error on 1k block filesystems
  ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
  ext4: return EIO on read error in ext4_find_entry
  ext4: forbid encrypting root directory
  ext4: send parallel discards on commit completions
  ext4: avoid unnecessary stalls in ext4_evict_inode()
  ext4: add nombcache mount option
  ext4: strong binding of xattr inode references
  ext4: eliminate xattr entry e_hash recalculation for removes
  ext4: reserve space for xattr entries/names
  quota: add get_inode_usage callback to transfer multi-inode charges
  ext4: xattr inode deduplication
  ...
2017-07-09 09:31:22 -07:00
Linus Torvalds
58f587cb0b Add support for 128-bit AES and some cleanups to fscrypt
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhktgACgkQ8vlZVpUN
 gaOQIQf+KM2s46sxxEl0/hjdBXR4OxTmSS2/0900NPyg7JHKlL8PdYslOyvMiKjo
 wEi+YPwwQgbHtxhI1VINfV/q12MZHwvmFOfD9NzjrISwfmfsKj0dBgZDAfBH82sK
 12wKgUxA8xJ4P+Xdvnz2PokRcFCsh1YUr5IUQkP3JR2RZOxNFUj42QwPJ2yWzqxO
 MsnepMjIHsxvXZi0E7sPjRaoFsh3DDeLmNl8sX6INodC7hxJ1LotYKqJhA4stQpB
 ezXY2tabwg3gaOWvWH7THyHhGntbZVDga3iRrKdNLahXN8OBdHktmG75ubiN6tEg
 x80pqQLgr41yIQuJVOuyeh5jLYZrww==
 =i4r9
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Add support for 128-bit AES and some cleanups to fscrypt"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: make ->dummy_context() return bool
  fscrypt: add support for AES-128-CBC
  fscrypt: inline fscrypt_free_filename()
2017-07-09 09:03:31 -07:00
Tahsin Erdogan
af65207c76 ext4: fix __ext4_new_inode() journal credits calculation
ea_inode feature allows creating extended attributes that are up to
64k in size. Update __ext4_new_inode() to pick increased credit limits.

To avoid overallocating too many journal credits, update
__ext4_xattr_set_credits() to make a distinction between xattr create
vs update. This helps __ext4_new_inode() because all attributes are
known to be new, so we can save credits that are normally needed to
delete old values.

Also, have fscrypt specify its maximum context size so that we don't
end up allocating credits for 64k size.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-06 00:01:59 -04:00
Linus Torvalds
81e3e04489 UUID/GUID updates:
- introduce the new uuid_t/guid_t types that are going to replace
    the somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)
  - consolidated generic uuid/guid helper functions lifted from XFS
    and libnvdimm (Amir and me)
  - conversions to the new types and helpers (Amir, Andy and me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllZfmILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMvyg/9EvWHOOsSdeDykCK3KdH2uIqnxwpl+m7ljccaGJIc
 MmaH0KnsP9p/Cuw5hESh2tYlmCYN7pmYziNXpf/LRS65/HpEYbs4oMqo8UQsN0UM
 2IXHfXY0HnCoG5OixH8RNbFTkxuGphsTY8meaiDr6aAmqChDQI2yGgQLo3WM2/Qe
 R9N1KoBWH/bqY6dHv+urlFwtsREm2fBH+8ovVma3TO73uZCzJGLJBWy3anmZN+08
 uYfdbLSyRN0T8rqemVdzsZ2SrpHYkIsYGUZV43F581vp8e/3OKMoMxpWRRd9fEsa
 MXmoaHcLJoBsyVSFR9lcx3axKrhAgBPZljASbbA0h49JneWXrzghnKBQZG2SnEdA
 ktHQ2sE4Yb5TZSvvWEKMQa3kXhEfIbTwgvbHpcDr5BUZX8WvEw2Zq8e7+Mi4+KJw
 QkvFC1S96tRYO2bxdJX638uSesGUhSidb+hJ/edaOCB/GK+sLhUdDTJgwDpUGmyA
 xVXTF51ramRS2vhlbzN79x9g33igIoNnG4/PV0FPvpCTSqxkHmPc5mK6Vals1lqt
 cW6XfUjSQECq5nmTBtYDTbA/T+8HhBgSQnrrvmferjJzZUFGr/7MXl+Evz2x4CjX
 OBQoAMu241w6Vp3zoXqxzv+muZ/NLar52M/zbi9TUjE0GvvRNkHvgCC4NmpIlWYJ
 Sxg=
 =J/4P
 -----END PGP SIGNATURE-----

Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid

Pull uuid subsystem from Christoph Hellwig:
 "This is the new uuid subsystem, in which Amir, Andy and I have started
  consolidating our uuid/guid helpers and improving the types used for
  them. Note that various other subsystems have pulled in this tree, so
  I'd like it to go in early.

  UUID/GUID summary:

   - introduce the new uuid_t/guid_t types that are going to replace the
     somewhat confusing uuid_be/uuid_le types and make the terminology
     fit the various specs, as well as the userspace libuuid library.
     (me, based on a previous version from Amir)

   - consolidated generic uuid/guid helper functions lifted from XFS and
     libnvdimm (Amir and me)

   - conversions to the new types and helpers (Amir, Andy and me)"

* tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
  ACPI: hns_dsaf_acpi_dsm_guid can be static
  mmc: sdhci-pci: make guid intel_dsm_guid static
  uuid: Take const on input of uuid_is_null() and guid_is_null()
  thermal: int340x_thermal: fix compile after the UUID API switch
  thermal: int340x_thermal: Switch to use new generic UUID API
  acpi: always include uuid.h
  ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
  ACPI / extlog: Switch to use new generic UUID API
  ACPI / bus: Switch to use new generic UUID API
  ACPI / APEI: Switch to use new generic UUID API
  acpi, nfit: Switch to use new generic UUID API
  MAINTAINERS: add uuid entry
  tmpfs: generate random sb->s_uuid
  scsi_debug: switch to uuid_t
  nvme: switch to uuid_t
  sysctl: switch to use uuid_t
  partitions/ldm: switch to use uuid_t
  overlayfs: use uuid_t instead of uuid_be
  fs: switch ->s_uuid to uuid_t
  ima/policy: switch to use uuid_t
  ...
2017-07-03 09:55:26 -07:00
Eric Biggers
c250b7dd8e fscrypt: make ->dummy_context() return bool
This makes it consistent with ->is_encrypted(), ->empty_dir(), and
fscrypt_dummy_context_enabled().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-23 20:11:50 -04:00
Eric Biggers
9ce0151a47 ext4: forbid encrypting root directory
Currently it's possible to encrypt all files and directories on an ext4
filesystem by deleting everything, including lost+found, then setting an
encryption policy on the root directory.  However, this is incompatible
with e2fsck because e2fsck expects to find, create, and/or write to
lost+found and does not have access to any encryption keys.  Especially
problematic is that if e2fsck can't find lost+found, it will create it
without regard for whether the root directory is encrypted.  This is
wrong for obvious reasons, and it causes a later run of e2fsck to
consider the lost+found directory entry to be corrupted.

Encrypting the root directory may also be of limited use because it is
the "all-or-nothing" use case, for which dm-crypt can be used instead.
(By design, encryption policies are inherited and cannot be overridden;
so the root directory having an encryption policy implies that all files
and directories on the filesystem have that same encryption policy.)

In any case, encrypting the root directory is broken currently and must
not be allowed; so start returning an error if userspace requests it.
For now only do this in ext4, because f2fs and ubifs do not appear to
have the lost+found requirement.  We could move it into
fscrypt_ioctl_set_policy() later if desired, though.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2017-06-23 00:10:36 -04:00
Daeho Jeong
a015434480 ext4: send parallel discards on commit completions
Now, when we mount ext4 filesystem with '-o discard' option, we have to
issue all the discard commands for the blocks to be deallocated and
wait for the completion of the commands on the commit complete phase.
Because this procedure might involve a lot of sequential combinations of
issuing discard commands and waiting for that, the delay of this
procedure might be too much long, even to 17.0s in our test,
and it results in long commit delay and fsync() performance degradation.

To reduce this kind of delay, instead of adding callback for each
extent and handling all of them in a sequential manner on commit phase,
we instead add a separate list of extents to free to the superblock and
then process this list at once after transaction commits so that
we can issue all the discard commands in a parallel manner like XFS
filesystem.

Finally, we could enhance the discard command handling performance.
The result was such that 17.0s delay of a single commit in the worst
case has been enhanced to 4.8s.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Kitae Lee <kitae87.lee@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-06-22 23:54:33 -04:00
Tahsin Erdogan
cdb7ee4c63 ext4: add nombcache mount option
The main purpose of mb cache is to achieve deduplication in
extended attributes. In use cases where opportunity for deduplication
is unlikely, it only adds overhead.

Add a mount option to explicitly turn off mb cache.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:55:14 -04:00
Tahsin Erdogan
7a9ca53aea quota: add get_inode_usage callback to transfer multi-inode charges
Ext4 ea_inode feature allows storing xattr values in external inodes to
be able to store values that are bigger than a block in size. Ext4 also
has deduplication support for these type of inodes. With deduplication,
the actual storage waste is eliminated but the users of such inodes are
still charged full quota for the inodes as if there was no sharing
happening in the background.

This design requires ext4 to manually charge the users because the
inodes are shared.

An implication of this is that, if someone calls chown on a file that
has such references we need to transfer the quota for the file and xattr
inodes. Current dquot_transfer() function implicitly transfers one inode
charge. With ea_inode feature, we would like to transfer multiple inode
charges.

Add get_inode_usage callback which can interrogate the total number of
inodes that were charged for a given inode.

[ Applied fix from Colin King to make sure the 'ret' variable is
  initialized on the successful return path.  Detected by
  CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
2017-06-22 11:46:48 -04:00
Tahsin Erdogan
dec214d00e ext4: xattr inode deduplication
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.

The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.

The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.

The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.

[ Fixed up journal credits calculation to handle inline data and the
  rare case where an shared xattr block can get freed when two thread
  race on breaking the xattr block sharing. --tytso ]

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:44:55 -04:00
Tahsin Erdogan
47387409ee ext2, ext4: make mb block cache names more explicit
There will be a second mb_cache instance that tracks ea_inodes. Make
existing names more explicit so that it is clear that they refer to
xattr block cache.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:28:55 -04:00
Tahsin Erdogan
c1a5d5f6ab ext4: improve journal credit handling in set xattr paths
Both ext4_set_acl() and ext4_set_context() need to be made aware of
ea_inode feature when it comes to credits calculation.

Also add a sufficient credits check in ext4_xattr_set_handle() right
after xattr write lock is grabbed. Original credits calculation is done
outside the lock so there is a possiblity that the initially calculated
credits are not sufficient anymore.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-21 22:28:40 -04:00
Christoph Hellwig
85787090a2 fs: switch ->s_uuid to uuid_t
For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2017-06-05 16:59:12 +02:00
Tahsin Erdogan
b8cb5a545c ext4: fix quota charging for shared xattr blocks
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.

Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:24:07 -04:00
Jan Kara
964edf66bf ext4: clear lockdep subtype for quota files on quota off
Quota files have special ranking of i_data_sem lock. We inform lockdep
about it when turning on quotas however when turning quotas off, we
don't clear the lockdep subclass from i_data_sem lock and thus when the
inode gets later reused for a normal file or directory, lockdep gets
confused and complains about possible deadlocks. Fix the problem by
resetting lockdep subclass of i_data_sem on quota off.

Cc: stable@vger.kernel.org
Fixes: daf647d2dd
Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:31:23 -04:00
Linus Torvalds
0fcc3ab23d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Incremental fixes and a small feature addition on top of the main
  libnvdimm 4.12 pull request:

   - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
     The size regression is fixed by moving all dax helpers into the
     dax-core and only specifying "select DAX" for FS_DAX and
     dax-capable drivers. He also asked for clarification of the
     NR_DEV_DAX config option which, on closer look, does not need to be
     a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
     for good measure.

   - Ben's attention to detail on -stable patch submissions caught a
     case where the recent fixes to arch_copy_from_iter_pmem() missed a
     condition where we strand dirty data in the cache. This is tagged
     for -stable and will also be included in the rework of the pmem api
     to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

   - Vishal adds a feature that missed the initial pull due to pending
     review feedback. It allows the kernel to clear media errors when
     initializing a BTT (atomic sector update driver) instance on a pmem
     namespace.

   - Ross noticed that the dax_device + dax_operations conversion broke
     __dax_zero_page_range(). The nvdimm unit tests fail to check this
     path, but xfstests immediately trips over it. No excuse for missing
     this before submitting the 4.12 pull request.

  These all pass the nvdimm unit tests and an xfstests spot check. The
  set has received a build success notification from the kbuild robot"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  filesystem-dax: fix broken __dax_zero_page_range() conversion
  libnvdimm, btt: ensure that initializing metadata clears poison
  libnvdimm: add an atomic vs process context flag to rw_bytes
  x86, pmem: Fix cache flushing for iovec write < 8 bytes
  device-dax: kill NR_DEV_DAX
  block, dax: move "select DAX" from BLOCK to FS_DAX
  device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX
2017-05-12 15:43:10 -07:00
Linus Torvalds
bf5f89463f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - various misc things

 - procfs updates

 - lib/ updates

 - checkpatch updates

 - kdump/kexec updates

 - add kvmalloc helpers, use them

 - time helper updates for Y2038 issues. We're almost ready to remove
   current_fs_time() but that awaits a btrfs merge.

 - add tracepoints to DAX

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
  selftests/vm: add a test for virtual address range mapping
  dax: add tracepoint to dax_insert_mapping()
  dax: add tracepoint to dax_writeback_one()
  dax: add tracepoints to dax_writeback_mapping_range()
  dax: add tracepoints to dax_load_hole()
  dax: add tracepoints to dax_pfn_mkwrite()
  dax: add tracepoints to dax_iomap_pte_fault()
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mm: introduce memalloc_noreclaim_{save,restore}
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
  mm/huge_memory.c: use zap_deposited_table() more
  time: delete CURRENT_TIME_SEC and CURRENT_TIME
  gfs2: replace CURRENT_TIME with current_time
  apparmorfs: replace CURRENT_TIME with current_time()
  lustre: replace CURRENT_TIME macro
  fs: ubifs: replace CURRENT_TIME_SEC with current_time
  fs: ufs: use ktime_get_real_ts64() for birthtime
  ...
2017-05-08 18:17:56 -07:00
Michal Hocko
a7c3e901a4 mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
  Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Linus Torvalds
dd727dad37 Add GETFSMAP support; some performance improvements for very large
file systems and for random write workloads into a preallocated file;
 bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlkPYB8ACgkQ8vlZVpUN
 gaP1HwgApoMQGegtRIbCZKUzKBJ2S6vwIoPAMz62JuwngOyWygJ1T1TliKTitG04
 XvijKpUHtEggMO/ZsUOCoyr2LzJlpVvvrJZsavEubO12LKreYMpvNraZF1GACYTb
 lIZpdWkpcEz5WnPV/PXW/dEMcSMhnKe8tbmHXMyAouSC6a55F5Wp456KF/plqkHU
 zkWTCDbEOtHThzpL8cthUL71ji62I3Op5jn/qOfKCm6/JtUlw5pYjWkRUNqqjSQE
 uQqMpqLxI/VjOdEiBPxEF6A+ZudZmoBQKY15ibWCcHUPFOPqk4RdYz6VivRI7zrg
 KrrKcdFT29MtKnRfAAoJcc0nJ4e1Iw==
 =il74
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - add GETFSMAP support

 - some performance improvements for very large file systems and for
   random write workloads into a preallocated file

 - bug fixes and cleanups.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: cleanup write flags handling from jbd2_write_superblock()
  ext4: mark superblock writes synchronous for nobarrier mounts
  ext4: inherit encryption xattr before other xattrs
  ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
  ext4: avoid unnecessary transaction stalls during writeback
  ext4: preload block group descriptors
  ext4: make ext4_shutdown() static
  ext4: support GETFSMAP ioctls
  vfs: add common GETFSMAP ioctl definitions
  ext4: evict inline data when writing to memory map
  ext4: remove ext4_xattr_check_entry()
  ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
  ext4: merge ext4_xattr_list() into ext4_listxattr()
  ext4: constify static data that is never modified
  ext4: trim return value and 'dir' argument from ext4_insert_dentry()
  jbd2: fix dbench4 performance regression for 'nobarrier' mounts
  jbd2: Fix lockdep splat with generic/270 test
  mm: retry writepages() on ENOMEM when doing an data integrity writeback
2017-05-08 11:30:05 -07:00
Dan Williams
ef51042472 block, dax: move "select DAX" from BLOCK to FS_DAX
For configurations that do not enable DAX filesystems or drivers, do not
require the DAX core to be built.

Given that the 'direct_access' method has been removed from
'block_device_operations', we can also go ahead and remove the
block-related dax helper functions from fs/block_dev.c to
drivers/dax/super.c. This keeps dax details out of the block layer and
lets the DAX core be built as a module in the FS_DAX=n case.

Filesystems need to include dax.h to call bdev_dax_supported().

Cc: linux-xfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-08 10:55:27 -07:00
Jan Kara
00473374b7 ext4: mark superblock writes synchronous for nobarrier mounts
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
generic_make_request_checks() however strips REQ_FUA flag from a bio
when the storage doesn't report volatile write cache and thus write
effectively becomes asynchronous which can lead to performance
regressions. This affects superblock writes for ext4. Fix the problem
by marking superblock writes always as synchronous.

Fixes: b685d3d65a
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-04 10:58:03 -04:00
Linus Torvalds
a3719f34fd Merge branch 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
 "The branch contains changes to quota code so that it does not modify
  persistent flags in inode->i_flags (it was the only place in kernel
  doing that) and handle it inside filesystem's quotaon/off handlers
  instead.

  The branch also contains two UDF cleanups, a couple of reiserfs fixes
  and one fix for ext2 quota locking"

* 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: Improve comments in ext4_quota_{on|off}()
  udf: use kmap_atomic for memcpy copying
  udf: use octal for permissions
  quota: Remove dquot_quotactl_ops
  reiserfs: Remove i_attrs_to_sd_attrs()
  reiserfs: Remove useless setting of i_flags
  jfs: Remove jfs_get_inode_flags()
  ext2: Remove ext2_get_inode_flags()
  ext4: Remove ext4_get_inode_flags()
  quota: Stop setting IMMUTABLE and NOATIME flags on quota files
  jfs: Set flags on quota files directly
  ext2: Set flags on quota files directly
  reiserfs: Set flags on quota files directly
  ext4: Set flags on quota files directly
  reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
  reiserfs: Make cancel_old_flush() reliable
  ext2: Call dquot_writeback_dquots() with s_umount held
  reiserfs: avoid a -Wmaybe-uninitialized warning
2017-05-03 11:35:47 -07:00
Andrew Perepechko
85c8f176a6 ext4: preload block group descriptors
With enabled meta_bg option block group descriptors
reading IO is not sequential and requires optimization.

Signed-off-by: Andrew Perepechko <andrew.perepechko@seagate.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:46:35 -04:00
Darrick J. Wong
0c9ec4beec ext4: support GETFSMAP ioctls
Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well.  Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:36:53 -04:00
Eric Biggers
d600618673 ext4: constify static data that is never modified
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:47:50 -04:00
Jan Kara
61a929870d ext4: Improve comments in ext4_quota_{on|off}()
Improve comments in ext4_quota_{on|off}() to explain that returning
success despite ext4_journal_start() failing is deliberate.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-24 16:49:16 +02:00
Jan Kara
957153fce8 ext4: Set flags on quota files directly
Currently immutable and noatime flags on quota files are set by quota
code which requires us to copy inode->i_flags to our on disk version of
quota flags in GETFLAGS ioctl and ext4_do_update_inode(). Move to
setting / clearing these on-disk flags directly to save that copying.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-19 14:21:23 +02:00
Eric Biggers
94840e3c80 fscrypt: eliminate ->prepare_context() operation
The only use of the ->prepare_context() fscrypt operation was to allow
ext4 to evict inline data from the inode before ->set_context().
However, there is no reason why this cannot be done as simply the first
step in ->set_context(), and in fact it makes more sense to do it that
way because then the policy modes and flags get validated before any
real work is done.  Therefore, merge ext4_prepare_context() into
ext4_set_context(), and remove ->prepare_context().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-03-15 14:15:47 -04:00
Linus Torvalds
cab7076a18 For this cycle we add support for the shutdown ioctl, which is
primarily used for testing, but which can be useful on production
 systems when a scratch volume is being destroyed and the data on it
 doesn't need to be saved.  This found (and we fixed) a number of bugs
 with ext4's recovery to corrupted file system --- the bugs increased
 the amount of data that could be potentially lost, and in the case of
 the inline data feature, could cause the kernel to BUG.
 
 Also included are a number of other bug fixes, including in ext4's
 fscrypt, DAX, inline data support.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirXesACgkQ8vlZVpUN
 gaMOzQf8Ct6uPatV+m855oR4dAbZr2+lY4A4C+vHDzBtSMkPRyLX8cuo8XcwfTIm
 vPVyDnL6EPyhXPxxfItu+92wAq1m5mVpKo57d0Ft5lw0rHxNtJTgVSRzsQ7VDRjj
 5qMHW2K7Bk7EjzTeW3SF8/3+hqpzkAvRtNCntcomk5h08+cWMC8JSnn1kqw+naIn
 EcbrC72GZb8JUELogVXC2vU58lp50SSBdr3l005jqKc5BvljMvdJ0Izn/3RVyU7u
 q7vtynhe2ScFcHe/UzL1QgmQOy32tJpbS0NHalW47aw3Ynmn4cSX0YhhT9FDjRNQ
 VOOfo1m1sAg166x0E+Nn7FeghTSSyA==
 =cPIf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "For this cycle we add support for the shutdown ioctl, which is
  primarily used for testing, but which can be useful on production
  systems when a scratch volume is being destroyed and the data on it
  doesn't need to be saved.

  This found (and we fixed) a number of bugs with ext4's recovery to
  corrupted file system --- the bugs increased the amount of data that
  could be potentially lost, and in the case of the inline data feature,
  could cause the kernel to BUG.

  Also included are a number of other bug fixes, including in ext4's
  fscrypt, DAX, inline data support"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN
  ext4: fix fencepost in s_first_meta_bg validation
  ext4: don't BUG when truncating encrypted inodes on the orphan list
  ext4: do not use stripe_width if it is not set
  ext4: fix stripe-unaligned allocations
  dax: assert that i_rwsem is held exclusive for writes
  ext4: fix DAX write locking
  ext4: add EXT4_IOC_GOINGDOWN ioctl
  ext4: add shutdown bit and check for it
  ext4: rename s_resize_flags to s_ext4_flags
  ext4: return EROFS if device is r/o and journal replay is needed
  ext4: preserve the needs_recovery flag when the journal is aborted
  jbd2: don't leak modified metadata buffers on an aborted journal
  ext4: fix inline data error paths
  ext4: move halfmd4 into hash.c directly
  ext4: fix use-after-iput when fscrypt contexts are inconsistent
  jbd2: fix use after free in kjournald2()
  ext4: fix data corruption in data=journal mode
  ext4: trim allocation requests to group size
  ext4: replace BUG_ON with WARN_ON in mb_find_extent()
  ...
2017-02-20 18:24:39 -08:00
Theodore Ts'o
2ba3e6e8af ext4: fix fencepost in s_first_meta_bg validation
It is OK for s_first_meta_bg to be equal to the number of block group
descriptor blocks.  (It rarely happens, but it shouldn't cause any
problems.)

https://bugzilla.kernel.org/show_bug.cgi?id=194567

Fixes: 3a4b77cd47
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-02-15 01:26:39 -05:00
Jan Kara
5469d7c308 ext4: do not use stripe_width if it is not set
Avoid using stripe_width for sbi->s_stripe value if it is not actually
set. It prevents using the stride for sbi->s_stripe.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-10 00:56:09 -05:00
Eric Biggers
6f69f0ed61 fscrypt: constify struct fscrypt_operations
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Richard Weinberger <richard@nod.at>
2017-02-08 10:59:57 -05:00
Theodore Ts'o
783d948544 ext4: add EXT4_IOC_GOINGDOWN ioctl
This ioctl is modeled after the xfs's XFS_IOC_GOINGDOWN ioctl.  (In
fact, it uses the same code points.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-05 19:47:14 -05:00
Theodore Ts'o
0db1ff222d ext4: add shutdown bit and check for it
Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-05 01:28:48 -05:00
Theodore Ts'o
4753d8a24d ext4: return EROFS if device is r/o and journal replay is needed
If the file system requires journal recovery, and the device is
read-ony, return EROFS to the mount system call.  This allows xfstests
generic/050 to pass.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-02-05 01:26:48 -05:00
Theodore Ts'o
97abd7d4b5 ext4: preserve the needs_recovery flag when the journal is aborted
If the journal is aborted, the needs_recovery feature flag should not
be removed.  Otherwise, it's the journal might not get replayed and
this could lead to more data getting lost.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-02-04 23:38:06 -05:00
Theodore Ts'o
670e9875eb ext4: add debug_want_extra_isize mount option
In order to test the inode extra isize expansion code, it is useful to
be able to easily create file systems that have inodes with extra
isize values smaller than the current desired value.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-11 15:32:22 -05:00
Eric Biggers
a5d431eff2 fscrypt: make fscrypt_operations.key_prefix a string
There was an unnecessary amount of complexity around requesting the
filesystem-specific key prefix.  It was unclear why; perhaps it was
envisioned that different instances of the same filesystem type could
use different key prefixes, or that key prefixes could be binary.
However, neither of those things were implemented or really make sense
at all.  So simplify the code by making key_prefix a const char *.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-08 01:03:41 -05:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
0110c350c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "In this pile:

   - autofs-namespace series
   - dedupe stuff
   - more struct path constification"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
  ocfs2: charge quota for reflinked blocks
  ocfs2: fix bad pointer cast
  ocfs2: always unlock when completing dio writes
  ocfs2: don't eat io errors during _dio_end_io_write
  ocfs2: budget for extent tree splits when adding refcount flag
  ocfs2: prohibit refcounted swapfiles
  ocfs2: add newlines to some error messages
  ocfs2: convert inode refcount test to a helper
  simple_write_end(): don't zero in short copy into uptodate
  exofs: don't mess with simple_write_{begin,end}
  9p: saner ->write_end() on failing copy into non-uptodate page
  fix gfs2_stuffed_write_end() on short copies
  fix ceph_write_end()
  nfs_write_end(): fix handling of short copies
  vfs: refactor clone/dedupe_file_range common functions
  fs: try to clone files first in vfs_copy_file_range
  vfs: misc struct path constification
  namespace.c: constify struct path passed to a bunch of primitives
  quota: constify struct path in quota_on
  ...
2016-12-17 18:44:00 -08:00
Linus Torvalds
5084fdf081 This merge request includes the dax-4.0-iomap-pmd branch which is
needed for both ext4 and xfs dax changes to use iomap for DAX.  It
 also includes the fscrypt branch which is needed for ubifs encryption
 work as well as ext4 encryption and fscrypt cleanups.
 
 Lots of cleanups and bug fixes, especially making sure ext4 is robust
 against maliciously corrupted file systems --- especially maliciously
 corrupted xattr blocks and a maliciously corrupted superblock.  Also
 fix ext4 support for 64k block sizes so it works well on ppcle.  Fixed
 mbcache so we don't miss some common xattr blocks that can be merged.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlhQQVEACgkQ8vlZVpUN
 gaN9TQgAoCD+V4kJjMCFhiV8u6QR3hqD6bOZbggo5wJf4CHglWkmrbAmc3jANOgH
 CKsXDRRjxuDjPXf1ukB1i4M7ArLYjkbbzKdsu7lismoJLS+w8uwUKSNdep+LYMjD
 alxUcf5DCzLlUmdOdW4yE22L+CwRfqfs8IpBvKmJb7DrAKiwJVA340ys6daBGuu1
 63xYx0QIyPzq0xjqLb6TVf88HUI4NiGVXmlm2wcrnYd5966hEZd/SztOZTVCVWOf
 Z0Z0fGQ1WJzmaBB9+YV3aBi+BObOx4m2PUprIa531+iEW02E+ot5Xd4vVQFoV/r4
 NX3XtoBrT1XlKagy2sJLMBoCavqrKw==
 =j4KP
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "This merge request includes the dax-4.0-iomap-pmd branch which is
  needed for both ext4 and xfs dax changes to use iomap for DAX. It also
  includes the fscrypt branch which is needed for ubifs encryption work
  as well as ext4 encryption and fscrypt cleanups.

  Lots of cleanups and bug fixes, especially making sure ext4 is robust
  against maliciously corrupted file systems --- especially maliciously
  corrupted xattr blocks and a maliciously corrupted superblock. Also
  fix ext4 support for 64k block sizes so it works well on ppcle. Fixed
  mbcache so we don't miss some common xattr blocks that can be merged"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  dax: Fix sleep in atomic contex in grab_mapping_entry()
  fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
  fscrypt: Delay bounce page pool allocation until needed
  fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
  fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
  fscrypt: Never allocate fscrypt_ctx on in-place encryption
  fscrypt: Use correct index in decrypt path.
  fscrypt: move the policy flags and encryption mode definitions to uapi header
  fscrypt: move non-public structures and constants to fscrypt_private.h
  fscrypt: unexport fscrypt_initialize()
  fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
  fscrypto: move ioctl processing more fully into common code
  fscrypto: remove unneeded Kconfig dependencies
  MAINTAINERS: fscrypto: recommend linux-fsdevel for fscrypto patches
  ext4: do not perform data journaling when data is encrypted
  ext4: return -ENOMEM instead of success
  ext4: reject inodes with negative size
  ext4: remove another test in ext4_alloc_file_blocks()
  Documentation: fix description of ext4's block_validity mount option
  ext4: fix checks for data=ordered and journal_async_commit options
  ...
2016-12-14 09:17:42 -08:00
Linus Torvalds
36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Sergey Karamov
73b92a2a5e ext4: do not perform data journaling when data is encrypted
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.

Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.

Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.

Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.

Fixes: 4461471107
Signed-off-by: Sergey Karamov <skaramov@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-12-10 17:54:58 -05:00
Al Viro
8c54ca9c68 quota: constify struct path in quota_on
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:03:06 -05:00
Jan Kara
ab04df7818 ext4: fix checks for data=ordered and journal_async_commit options
Combination of data=ordered mode and journal_async_commit mount option
is invalid. However the check in parse_options() fails to detect the
case where we simply end up defaulting to data=ordered mode and we
detect the problem only on remount which triggers hard to understand
failure to remount the filesystem.

Fix the checking of mount options to take into account also the default
mode by moving the check somewhat later in the mount sequence.

Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-03 16:20:53 -05:00
Eryu Guan
3a4b77cd47 ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 15:08:37 -05:00
Eric Sandeen
9060dd2c50 ext4: fix mmp use after free during unmount
In ext4_put_super, we call brelse on the buffer head containing
the ext4 superblock, but then try to use it when we stop the
mmp thread, because when the thread shuts down it does:

write_mmp_block
  ext4_mmp_csum_set
    ext4_has_metadata_csum
      WARN_ON_ONCE(ext4_has_feature_metadata_csum(sb)...)

which reaches into sb->s_fs_info->s_es->s_feature_ro_compat,
which lives in the superblock buffer s_sbh which we just released.

Fix this by moving the brelse down to a point where we are no
longer using it.

Reported-by: Wang Shu <shuwang@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-11-26 14:24:51 -05:00
Eric Biggers
2f8f5e76c7 ext4: avoid lockdep warning when inheriting encryption context
On a lockdep-enabled kernel, xfstests generic/027 fails due to a lockdep
warning when run on ext4 mounted with -o test_dummy_encryption:

    xfs_io/4594 is trying to acquire lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813096ef>] jbd2_log_wait_commit+0x5/0x11b

    but task is already holding lock:
     (jbd2_handle
    ){++++.+}, at:
    [<ffffffff813000de>] start_this_handle+0x354/0x3d8

The abbreviated call stack is:

 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130972a>] jbd2_log_wait_commit+0x40/0x11b
 [<ffffffff813096ef>] ? jbd2_log_wait_commit+0x5/0x11b
 [<ffffffff8130987b>] ? __jbd2_journal_force_commit+0x76/0xa6
 [<ffffffff81309896>] __jbd2_journal_force_commit+0x91/0xa6
 [<ffffffff813098b9>] jbd2_journal_force_commit_nested+0xe/0x18
 [<ffffffff812a6049>] ext4_should_retry_alloc+0x72/0x79
 [<ffffffff812f0c1f>] ext4_xattr_set+0xef/0x11f
 [<ffffffff812cc35b>] ext4_set_context+0x3a/0x16b
 [<ffffffff81258123>] fscrypt_inherit_context+0xe3/0x103
 [<ffffffff812ab611>] __ext4_new_inode+0x12dc/0x153a
 [<ffffffff812bd371>] ext4_create+0xb7/0x161

When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file.  This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.

If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open.  However, the above lockdep warning is still triggered.

This was a false positive which was introduced by: 1eaa566d36: jbd2:
track more dependencies on transaction commit

Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set().  And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.

Signed-off-by: Eric Biggers <ebiggers@google.com>
2016-11-21 11:52:44 -05:00
Jan Kara
a3caa24b70 ext4: only set S_DAX if DAX is really supported
Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:32:59 -05:00
Theodore Ts'o
8cdf3372fe ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-19 20:58:15 -05:00
Theodore Ts'o
c48ae41baf ext4: add sanity checking to count_overhead()
The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:37:47 -05:00
Theodore Ts'o
cd6bb35bf7 ext4: use more strict checks for inodes_per_block on mount
Centralize the checks for inodes_per_block and be more strict to make
sure the inodes_per_block_group can't end up being zero.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-18 13:28:30 -05:00
Theodore Ts'o
5aee0f8a3f ext4: fix in-superblock mount options processing
Fix a large number of problems with how we handle mount options in the
superblock.  For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters.  It's unlikely
this will cause a security problem, but it could result in an invalid
parse.  Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock.  (Fortunately it only happens on file systems with a
1k block size.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:24:26 -05:00
Theodore Ts'o
9e47a4c9fc ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount.  This
is important for security reasons (although we shouldn't be just
depending on this check).

Ref: http://www.securityfocus.com/archive/1/539661
Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-11-18 13:00:24 -05:00
Deepa Dinamani
eeca7ea1ba ext4: use current_time() for inode timestamps
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.

current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().

Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2016-11-14 21:40:10 -05:00
Theodore Ts'o
1566a48aaa ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:

  BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358
  in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount
  CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
   ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166
   ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153
   ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0
  Call Trace:
    [<ffffffff81318c89>] dump_stack+0x67/0x9e
    [<ffffffff810810b0>] ___might_sleep+0xf0/0x140
    [<ffffffff81081153>] __might_sleep+0x53/0xb0
    [<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290
    [<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230
    [<ffffffff81081153>] ? __might_sleep+0x53/0xb0
    [<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320

Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.

This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990 was trying to address),
but a Warning is better than BUG.

Fixes: 4743f83990
Cc: stable@vger.kernel.org # 4.9
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:29 -05:00
Theodore Ts'o
2c98eb5ea2 ext4: allow ext4_truncate() to return an error
This allows us to properly propagate errors back up to
ext4_truncate()'s callers.  This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-11-13 22:02:26 -05:00
Christoph Hellwig
70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Joe Perches
651e1c3b15 ext4: super.c: Update logging style using KERN_CONT
Recent commit require line continuing printks to use PR_CONT.

Update super.c to use KERN_CONT and use vsprintf extension %pV to
avoid a printk/vprintk/printk("\n") sequence as well.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-10-12 23:12:53 -04:00
Eric Whitney
3c816ded78 ext4: use journal inode to determine journal overhead
When a file system contains an internal journal that has not been
loaded, use the journal inode's i_size field to determine its
contribution to the file system's overhead.  (The journal's j_maxlen
field is normally used to determine its size, but it's unavailable when
the journal has not been loaded.)

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 02:08:49 -04:00
Eric Whitney
c6cb7e776a ext4: create function to read journal inode
Factor out the code used in ext4_get_journal() to read a valid journal
inode from storage, enabling its reuse in other functions.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 02:05:09 -04:00
Dmitry Monakhov
e22834f024 ext4: improve ext4lazyinit scalability
ext4lazyinit is a global thread. This thread performs itable
initalization under li_list_mtx mutex.

It basically does the following:
ext4_lazyinit_thread
  ->mutex_lock(&eli->li_list_mtx);
  ->ext4_run_li_request(elr)
    ->ext4_init_inode_table-> Do a lot of IO if the list is large

And when new mount/umount arrive they have to block on ->li_list_mtx
because  lazy_thread holds it during full walk procedure.
ext4_fill_super
 ->ext4_register_li_request
   ->mutex_lock(&ext4_li_info->li_list_mtx);
   ->list_add(&elr->lr_request, &ext4_li_info >li_request_list);
In my case mount takes 40minutes on server with 36 * 4Tb HDD.
Common user may face this in case of very slow dev ( /dev/mmcblkXXX)
Even more. If one of filesystems was frozen lazyinit_thread will simply
block on sb_start_write() so other mount/umount will be stuck forever.

This patch changes logic like follows:
- grab ->s_umount read sem before processing new li_request.
  After that it is safe to drop li_list_mtx because all callers of
  li_remove_request are holding ->s_umount for write.
- li_thread skips frozen SB's

Locking order:
Mh KOrder is asserted by umount path like follows: s_umount ->li_list_mtx so
the only way to to grab ->s_mount inside li_thread is via down_read_trylock

xfstests:ext4/023
#PSBM-49658

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:38:36 -04:00
Jan Kara
49da939272 ext4: enable quota enforcement based on mount options
When quota information is stored in quota files, we enable only quota
accounting on mount and enforcement is enabled only in response to
Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
possibility to enable quota enforcement on mount by specifying
corresponding quota mount option (usrquota, grpquota, prjquota).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-05 23:08:16 -04:00
Linus Torvalds
b8927721ae Fix bugs that could cause kernel deadlocks or file system corruption
while moving xattrs to expand the extended inode.  Also add some
 sanity checks to the block group descriptors to make sure we don't end
 up overwriting the superblock.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXw7i2AAoJEPL5WVaVDYGj96gH/A8rNgx7BoqPx3kanVEamblT
 tM0X9JcEGmKHN4enRts2b78EWbR0/U0SOP92+fg9SSq2MDJ0/kdaKLWmbUwx8jUi
 B7HMEqCprlCdigK7wwt3xF+6edyZRhtzlWy3bhxJ40f0KT5CuriSQbxogr931uKl
 hUKW2h5JtUqHtINzTt4oWjVm8xwrScxuYHYAcpw0G42ZzfO6xQOzQdowcx4m3cE9
 PrtTbU5MwW8/wgsdLiClScQq30MK/GCbHh5heyRt1BcNo9+MDsZDOgdavh9StfnW
 Bl1N6zwRtRBJNcpKWfTfwU4NTIvStCTyA8BJgKgE95YIHDsstJVl4MO7ot25qbM=
 =pXe+
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix bugs that could cause kernel deadlocks or file system corruption
  while moving xattrs to expand the extended inode.

  Also add some sanity checks to the block group descriptors to make
  sure we don't end up overwriting the superblock"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid deadlock when expanding inode size
  ext4: properly align shifted xattrs when expanding inodes
  ext4: fix xattr shifting when expanding inodes part 2
  ext4: fix xattr shifting when expanding inodes
  ext4: validate that metadata blocks do not overlap superblock
  ext4: reserve xattr index for the Hurd
2016-08-29 12:37:11 -07:00
Theodore Ts'o
829fa70ddd ext4: validate that metadata blocks do not overlap superblock
A number of fuzzing failures seem to be caused by allocation bitmaps
or other metadata blocks being pointed at the superblock.

This can cause kernel BUG or WARNings once the superblock is
overwritten, so validate the group descriptor blocks to make sure this
doesn't happen.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-01 00:51:02 -04:00
Linus Torvalds
396d10993f The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
 fs/crypto.  I've updated the MAINTAINERS file to add an entry for
 fs/crypto listing Jaeguk Kim and myself as the maintainers.
 
 There are also a number of bug fixes, most notably for some problems
 found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
 fixed is a writeback deadlock detected by generic/130, and some
 potential races in the metadata checksum code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXlbP9AAoJEPL5WVaVDYGjGxgIAJ9YIqme//yix63oHYLhDNea
 lY/TLqZrb9/TdDRvGyZa3jYaKaIejL53eEQS9nhEB/JI0sEiDpHmOrDOxdj8Hlsw
 fm7nJyh1u4vFKPyklCbIvLAje1vl8X/6OvqQiwh45gIxbbsFftaBWtccW+UtEkIP
 Fx65Vk7RehJ/sNrM0cRrwB79YAmDS8P6BPyzdMRk+vO/uFqyq7Auc+pkd+bTlw/m
 TDAEIunlk0Ovjx75ru1zaemL1JJx5ffehrJmGCcSUPHVbMObOEKIrlV50gAAKVhO
 qbZAri3mhDvyspSLuS/73L9skeCiWFLhvojCBGu4t2aa3JJolmItO7IpKi4HdRU=
 =bxGK
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The major change this cycle is deleting ext4's copy of the file system
  encryption code and switching things over to using the copies in
  fs/crypto.  I've updated the MAINTAINERS file to add an entry for
  fs/crypto listing Jaeguk Kim and myself as the maintainers.

  There are also a number of bug fixes, most notably for some problems
  found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
  fixed is a writeback deadlock detected by generic/130, and some
  potential races in the metadata checksum code"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: verify extent header depth
  ext4: short-cut orphan cleanup on error
  ext4: fix reference counting bug on block allocation error
  MAINTAINRES: fs-crypto maintainers update
  ext4 crypto: migrate into vfs's crypto engine
  ext2: fix filesystem deadlock while reading corrupted xattr block
  ext4: fix project quota accounting without quota limits enabled
  ext4: validate s_reserved_gdt_blocks on mount
  ext4: remove unused page_idx
  ext4: don't call ext4_should_journal_data() on the journal inode
  ext4: Fix WARN_ON_ONCE in ext4_commit_super()
  ext4: fix deadlock during page writeback
  ext4: correct error value of function verifying dx checksum
  ext4: avoid modifying checksum fields directly during checksum verification
  ext4: check for extents that wrap around
  jbd2: make journal y2038 safe
  jbd2: track more dependencies on transaction commit
  jbd2: move lockdep tracking to journal_s
  jbd2: move lockdep instrumentation for jbd2 handles
  ext4: respect the nobarrier mount option in nojournal mode
  ...
2016-07-26 18:35:55 -07:00
Vegard Nossum
c65d5c6c81 ext4: short-cut orphan cleanup on error
If we encounter a filesystem error during orphan cleanup, we should stop.
Otherwise, we may end up in an infinite loop where the same inode is
processed again and again.

    EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
    EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
    Aborting journal on device loop0-8.
    EXT4-fs (loop0): Remounting filesystem read-only
    EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
    EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
    EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
    [...]

See-also: c9eb13a910 ("ext4: fix hang when processing corrupted orphaned inode list")
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-14 23:21:35 -04:00
Jaegeuk Kim
a7550b30ab ext4 crypto: migrate into vfs's crypto engine
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-10 14:01:03 -04:00
Theodore Ts'o
5b9554dc5b ext4: validate s_reserved_gdt_blocks on mount
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory.  Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-05 20:01:52 -04:00
Pranay Kr. Srivastava
4743f83990 ext4: Fix WARN_ON_ONCE in ext4_commit_super()
If there are racing calls to ext4_commit_super() it's possible for
another writeback of the superblock to result in the buffer being
marked with an error after we check if the buffer is marked as having
a write error and the buffer up-to-date flag is set again.  If that
happens mark_buffer_dirty() can end up throwing a WARN_ON_ONCE.

Fix this by moving this check to write before we call
write_buffer_dirty(), and keeping the buffer locked during this whole
sequence.

Signed-off-by: Pranay Kr. Srivastava <pranjas@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-04 10:24:52 -04:00
Daeho Jeong
b47820edd1 ext4: avoid modifying checksum fields directly during checksum verification
We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-07-03 17:51:39 -04:00
Mike Christie
dfec8a14fc fs: have ll_rw_block users pass in op and flags separately
This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Toshi Kani
87eefeb4e8 ext4: Add alignment check for DAX mount
When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Reported-by: Micah Parrish <micah.parrish@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:11 -06:00
Daeho Jeong
c8585c6fca ext4: fix races between changing inode journal mode and ext4_writepages
In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25 23:22:35 -04:00
Linus Torvalds
93061f390f These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems.  These have been
 reviewed by Al and the respective file system mtinainers and are going
 through the ext4 tree for convenience.
 
 This also has a few ext4 encryption bug fixes that were discovered in
 Android testing (yes, we will need to get these sync'ed up with the
 fs/crypto code; I'll take care of that).  It also has some bug fixes
 and a change to ignore the legacy quota options to allow for xfstests
 regression testing of ext4's internal quota feature and to be more
 consistent with how xfs handles this case.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
 OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
 mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
 9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
 Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
 P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
 =/HHR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "These changes contains a fix for overlayfs interacting with some
  (badly behaved) dentry code in various file systems.  These have been
  reviewed by Al and the respective file system mtinainers and are going
  through the ext4 tree for convenience.

  This also has a few ext4 encryption bug fixes that were discovered in
  Android testing (yes, we will need to get these sync'ed up with the
  fs/crypto code; I'll take care of that).  It also has some bug fixes
  and a change to ignore the legacy quota options to allow for xfstests
  regression testing of ext4's internal quota feature and to be more
  consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: ignore quota mount options if the quota feature is enabled
  ext4 crypto: fix some error handling
  ext4: avoid calling dquot_get_next_id() if quota is not enabled
  ext4: retry block allocation for failed DIO and DAX writes
  ext4: add lockdep annotations for i_data_sem
  ext4: allow readdir()'s of large empty directories to be interrupted
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  ext4 crypto: use dget_parent() in ext4_d_revalidate()
  ext4: use file_dentry()
  ext4: use dget_parent() in ext4_file_open()
  nfs: use file_dentry()
  fs: add file_dentry()
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07 17:22:20 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Theodore Ts'o
c325a67c72 ext4: ignore quota mount options if the quota feature is enabled
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present.  This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified.  This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-03 17:03:37 -04:00
Theodore Ts'o
8f0e8746b4 ext4: avoid calling dquot_get_next_id() if quota is not enabled
This should be fixed in the quota layer so we can test with the quota
mutex held, but for now, we need this to avoid tests from crashing the
kernel aborting the regression test suite.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-01 12:00:03 -04:00
Theodore Ts'o
daf647d2dd ext4: add lockdep annotations for i_data_sem
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode.  This results in this complaint:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&ei->i_data_sem);
                                lock(&s->s_dquot.dqio_mutex);
                                lock(&ei->i_data_sem);
   lock(&s->s_dquot.dqio_mutex);

Google-Bug-Id: 27907753

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-04-01 01:31:28 -04:00
Linus Torvalds
77d913178c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and quota updates from Jan Kara:
 "This contains a rewrite of UDF handling of filename encoding to fix
  remaining overflow issues from Andrew Gabbasov and quota changes to
  support new Q_[X]GETNEXTQUOTA quotactl for VFS quota formats"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Fix possible GPF due to uninitialised pointers
  ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
  quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem
  quota: Fix possible races during quota loading
  ocfs2: Implement get_next_id()
  quota_v2: Implement get_next_id() for V2 quota format
  quota: Add support for ->get_nextdqblk() for VFS quota
  udf: Merge linux specific translation into CS0 conversion function
  udf: Remove struct ustr as non-needed intermediate storage
  udf: Use separate buffer for copying split names
  udf: Adjust UDF_NAME_LEN to better reflect actual restrictions
  udf: Join functions for UTF8 and NLS conversions
  udf: Parameterize output length in udf_put_filename
  quota: Allow Q_GETQUOTA for frozen filesystem
  quota: Fixup comments about return value of Q_[X]GETNEXTQUOTA
2016-03-21 12:22:37 -07:00
Ales Novak
7915a861c0 ext4: print ext4 mount option data_err=abort correctly
If data_err=abort option is specified for an ext3/ext4 mount,
/proc/mounts does show it as "(null)". This is caused by token2str()
returning NULL for Opt_data_err_abort (due to its pattern containing
'=').

Signed-off-by: Ales Novak <alnovak@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-12 21:55:50 -05:00
Jan Kara
600be30a8b ext4: remove i_ioend_count
Remove counter of pending io ends as it is unused.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:39:21 -05:00
Jan Kara
e142d05263 ext4: use i_mutex to serialize unaligned AIO DIO
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:44:50 -05:00
Andreas Gruenbacher
2335d05f3a ext4: kill ext4_mballoc_ready
This variable, introduced in commit 9c191f70, is unnecessary: it is set
once the module has been initialized correctly, and ext4_fill_super
cannot run unless the module has been initialized correctly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:41:05 -05:00
Jan Kara
82939d7999 ext4: convert to mbcache2
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 11:50:13 -05:00
Eric Sandeen
6332b9b5e7 ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
We forgot to set .get_nextdqblk operation in quotactl_ops structure used
by ext4 when quota is using hidden inode thus the operation was not
really supported. Fix the omission.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-19 19:28:07 +01:00
Jan Kara
be6257b251 quota: Add support for ->get_nextdqblk() for VFS quota
Add infrastructure for supporting get_nextdqblk() callback for VFS
quotas. Translate the operation into a callback to appropriate
filesystem and consequently to quota format callback.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-09 13:05:23 +01:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
391f2a16b7 Some locking and page fault bug fixes from Jan Kara, some ext4
encryption fixes from me, and Li Xi's Project Quota commits.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWnwkiAAoJEPL5WVaVDYGjyAAH/1dj1nNL9h+d12V3zXbvoPkg
 5RFw/2QfMZ+GE3Lln9gxTBDSyo/9m8hUK8eg0WpIRtGX9NbKcyrWEGJa2XF++43k
 tVpKGyN6cqkwPu4M6EPIK9yRvuALGB5PJE/u0q1lu9VoIAgtin3F/bAQK/iHnrUg
 M3+lVDtKcmbhqCdocaLLZD6Q4xlQI3wJne99pYt+Dtx95aOQY9v9SV030i7sOnEt
 R5JrhmfkgNqVTB8Zz0IxOp5LQlOkuyvtnZ44yYgJH8ckCUnDQI2hbksSqcMamJ1Y
 QJWBzRhVXU9gs1nCRy/Xh48mSk+nvZW9aglk+Syzbzg5C63SgwYcqvbCBqJJEdc=
 =HjkT
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Some locking and page fault bug fixes from Jan Kara, some ext4
  encryption fixes from me, and Li Xi's Project Quota commits"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fs: clean up the flags definition in uapi/linux/fs.h
  ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
  ext4: add project quota support
  ext4: adds project ID support
  ext4 crypto: simplify interfaces to directory entry insert functions
  ext4 crypto: add missing locking for keyring_key access
  ext4: use pre-zeroed blocks for DAX page faults
  ext4: implement allocation of pre-zeroed blocks
  ext4: provide ext4_issue_zeroout()
  ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag
  ext4: document lock ordering
  ext4: fix races of writeback with punch hole and zero range
  ext4: fix races between buffered IO and collapse / insert range
  ext4: move unlocked dio protection from ext4_alloc_file_blocks()
  ext4: fix races between page faults and hole punching
2016-01-22 11:23:35 -08:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Li Xi
689c958cbe ext4: add project quota support
This patch adds mount options for enabling/disabling project quota
accounting and enforcement. A new specific inode is also used for
project quota accounting.

[ Includes fix from Dan Carpenter to crrect error checking from dqget(). ]

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:22 -05:00
Li Xi
040cb3786d ext4: adds project ID support
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:21 -05:00
Jan Kara
e74031fd7e ext4: document lock ordering
We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:35:49 -05:00
Jan Kara
ea3d7209ca ext4: fix races between page faults and hole punching
Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:28:03 -05:00
Dan Williams
ef83b6e8f4 ext2, ext4: warn when mounting with dax enabled
Similar to XFS warn when mounting DAX while it is still considered under
development.  Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem.  The
maturity of a given DAX implementation is filesystem specific.

Cc: <stable@vger.kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-ext4@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-16 09:43:54 -08:00
Linus Torvalds
ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Dmitry Monakhov
1e381f60da ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts

Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/  -ocommit=20,journal_async_commit
#cat /proc/mounts  | grep /mnt
 /dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0

But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.

This patch disallow following options for journalless configurations:
 - journal_checksum
 - journal_async_commit
 - commit=%ld
 - data={writeback,ordered,journal}

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-18 23:50:26 -04:00
Dmitry Monakhov
c93cf2d757 ext4: explicit mount options parsing cleanup
Currently MOPT_EXPLICIT treated as EXPLICIT_DELALLOC which may be changed
in future. Let's fix it now.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-18 23:35:32 -04:00
Daeho Jeong
4327ba52af ext4, jbd2: ensure entering into panic after recording an error in superblock
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option.  But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A                        Task B
ext4_handle_error()
-> jbd2_journal_abort()
  -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    |                         __ext4_abort()
    |                         -> jbd2_journal_abort()
    |                         | -> __journal_abort_soft()
    |                         |   -> if (journal->j_flags & JBD2_ABORT)
    |                         |           return;
    |                         -> panic()
    |
    -> jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-18 17:02:56 -04:00
Darrick J. Wong
e2b911c535 ext4: clean up feature test macros with predicate functions
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros.  Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-10-17 16:18:43 -04:00
Darrick J. Wong
6a797d2737 ext4: call out CRC and corruption errors with specific error codes
Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:16:04 -04:00
Darrick J. Wong
8c81bd8f58 ext4: store checksum seed in superblock
Allow the filesystem to store the metadata checksum seed in the
superblock and add an incompat feature to say that we're using it.
This enables tune2fs to change the UUID on a mounted metadata_csum
FS without having to (racy!) rewrite all disk metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:16:02 -04:00
Theodore Ts'o
ebd173beb8 ext4: move procfs registration code to fs/ext4/sysfs.c
This allows us to refactor the procfs code, which saves a bit of
compiled space.  More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:46:17 -04:00
Theodore Ts'o
b579901882 ext4: move sysfs code from super.c to fs/ext4/sysfs.c
Also statically allocate the ext4_kset and ext4_feat objects, since we
only need exactly one of each, and it's simpler and less code if we
drop the dynamic allocation and deallocation when it's not needed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:44:17 -04:00
Kees Cook
a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Linus Torvalds
ea814ab9aa Pretty much all bug fixes and clean ups for 4.3, after a lot of
features and other churn going into 4.2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJV55TlAAoJEPL5WVaVDYGjyzYH/1WtZpIzRjp7o+3H4/vFqONg
 R1Fsw785C1w8WX2QuIK/m31u4XO+VeCV4jWA9DuqnSzWm9w9C/4kTqITd4Hwp416
 /9gJvYoZHHaDikxpWWADptDi8IoLohTlcFVCHIvvf53cGehVEEsc2WOijUZo7Cgv
 O454Nm3tK0CQ3yrCIlf5SyvkUZSMTiawLLJJzd4GCyvU13C1SnABNQj8UxKisBA5
 cP8q4O2nPg/S9rkYxnFAifQyZppd3jMvorUaq9eHiWMjl95o6e/6+wYGnHhoFUvr
 /P1dNjJYbzk+TUzlsDkq2zANK2UsB3iNNi8YwLFOpfFcuYopmUAYRIWOgIZWYUQ=
 =ijuI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Pretty much all bug fixes and clean ups for 4.3, after a lot of
  features and other churn going into 4.2"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Revert "ext4: remove block_device_ejected"
  ext4: ratelimit the file system mounted message
  ext4: silence a format string false positive
  ext4: simplify some code in read_mmp_block()
  ext4: don't manipulate recovery flag when freezing no-journal fs
  jbd2: limit number of reserved credits
  ext4 crypto: remove duplicate header file
  ext4: update c/mtime on truncate up
  jbd2: avoid infinite loop when destroying aborted journal
  ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
  ext4 crypto: fix spelling typo in comment
  ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
  ext4: reject journal options for ext2 mounts
  ext4: implement cgroup writeback support
  ext4: replace ext4_io_submit->io_op with ->io_wbc
  ext4 crypto: check for too-short encrypted file names
  ext4 crypto: use a jbd2 transaction when adding a crypto policy
  jbd2: speedup jbd2_journal_dirty_metadata()
2015-09-03 12:52:19 -07:00
Theodore Ts'o
bdfe0cbd74 Revert "ext4: remove block_device_ejected"
This reverts commit 08439fec26.

Unfortunately we still need to test for bdi->dev to avoid a crash when a
USB stick is yanked out while a file system is mounted:

   usb 2-2: USB disconnect, device number 2
   Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
   JBD2: Error -5 detected when updating journal superblock for sdb1-8.
   BUG: unable to handle kernel paging request at 34beb000
   IP: [<c136ce88>] __percpu_counter_add+0x18/0xc0
   *pdpt = 0000000023db9001 *pde = 0000000000000000 
   Oops: 0000 [#1] SMP 
   CPU: 0 PID: 4083 Comm: umount Tainted: G     U     OE   4.1.1-040101-generic #201507011435
   Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
   task: ebf06b50 ti: ebebc000 task.ti: ebebc000
   EIP: 0060:[<c136ce88>] EFLAGS: 00010082 CPU: 0
   EIP is at __percpu_counter_add+0x18/0xc0
   EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
   ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
   CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
   Stack:
    c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
    ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
    f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
   Call Trace:
    [<c1160454>] account_page_dirtied+0x74/0x110
    [<c11e613f>] __set_page_dirty+0x3f/0xb0
    [<c11e6203>] mark_buffer_dirty+0x53/0xc0
    [<c124a0cb>] ext4_commit_super+0x17b/0x250
    [<c124ac71>] ext4_put_super+0xc1/0x320
    [<c11f04ba>] ? fsnotify_unmount_inodes+0x1aa/0x1c0
    [<c11cfeda>] ? evict_inodes+0xca/0xe0
    [<c11b925a>] generic_shutdown_super+0x6a/0xe0
    [<c10a1df0>] ? prepare_to_wait_event+0xd0/0xd0
    [<c1165a50>] ? unregister_shrinker+0x40/0x50
    [<c11b92f6>] kill_block_super+0x26/0x70
    [<c11b94f5>] deactivate_locked_super+0x45/0x80
    [<c11ba007>] deactivate_super+0x47/0x60
    [<c11d2b39>] cleanup_mnt+0x39/0x80
    [<c11d2bc0>] __cleanup_mnt+0x10/0x20
    [<c1080b51>] task_work_run+0x91/0xd0
    [<c1011e3c>] do_notify_resume+0x7c/0x90
    [<c1720da5>] work_notify
   Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 <64> 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
   EIP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
   CR2: 0000000034beb000
   ---[ end trace dd564a7bea834ecd ]---

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-16 10:03:57 -04:00
Theodore Ts'o
e294a5371b ext4: ratelimit the file system mounted message
The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 14:59:44 -04:00
Eric Sandeen
c642dc9e1a ext4: don't manipulate recovery flag when freezing no-journal fs
At some point along this sequence of changes:

f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen.  This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.

(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).

To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.

Reported-by: Stu Mark <smark@datto.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-15 10:45:06 -04:00
Jan Kara
c290ea01ab fs: Remove ext3 filesystem driver
The functionality of ext3 is fully supported by ext4 driver. Major
distributions (SUSE, RedHat) already use ext4 driver to handle ext3
filesystems for quite some time. There is some ugliness in mm resulting
from jbd cleaning buffers in a dirty page without cleaning page dirty
bit and also support for buffer bouncing in the block layer when stable
pages are required is there only because of jbd. So let's remove the
ext3 driver. This saves us some 28k lines of duplicated code.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-07-23 20:59:40 +02:00
Daeho Jeong
564bc40252 ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
When an error condition is detected, an error status should be recorded into
superblocks of EXT4 or JBD2. However, the write request is submitted now
without REQ_FUA flag, even in "barrier=1" mode, which is followed by
panic() function in "errors=panic" mode. On mobile devices which make
whole system reset as soon as kernel panic occurs, this write request
containing an error flag will disappear just from storage cache without
written to the physical cells. Therefore, when next start, even forever,
the error flag cannot be shown in both superblocks, and e2fsck cannot fix
the filesystem problems automatically, unless e2fsck is executed in
force checking mode.

[ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-23 09:46:11 -04:00
Carlos Maiolino
5ba92bcf0d ext4: reject journal options for ext2 mounts
There is no reason to allow ext2 filesystems be mounted with journal
mount options. So, this patch adds them to the MOPT_NO_EXT2 mount
options list.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:57:59 -04:00
Tejun Heo
001e4a8775 ext4: implement cgroup writeback support
For ordered and writeback data modes, all data IOs go through
ext4_io_submit.  This patch adds cgroup writeback support by invoking
wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in
io_submit_add_bh().  Journal data which is written by jbd2 worker is
left alone by this patch and will always be written out from the root
cgroup.

ext4_fill_super() is updated to set MS_CGROUPWB when data mode is
either ordered or writeback.  In journaled data mode, most IOs become
synchronous through the journal and enabling cgroup writeback support
doesn't make much sense or difference.  Journaled data mode is left
alone.

Lightly tested with sequential data write workload.  Behaves as
expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:51:26 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Linus Torvalds
47a469421d Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - most of the rest of MM

 - lots of misc things

 - procfs updates

 - printk feature work

 - updates to get_maintainer, MAINTAINERS, checkpatch

 - lib/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  exit,stats: /* obey this comment */
  coredump: add __printf attribute to cn_*printf functions
  coredump: use from_kuid/kgid when formatting corename
  fs/reiserfs: remove unneeded cast
  NILFS2: support NFSv2 export
  fs/befs/btree.c: remove unneeded initializations
  fs/minix: remove unneeded cast
  init/do_mounts.c: add create_dev() failure log
  kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
  fs/efs: femove unneeded cast
  checkpatch: emit "NOTE: <types>" message only once after multiple files
  checkpatch: emit an error when there's a diff in a changelog
  checkpatch: validate MODULE_LICENSE content
  checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
  checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
  checkpatch: fix processing of MEMSET issues
  checkpatch: suggest using ether_addr_equal*()
  checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
  checkpatch: remove local from codespell path
  checkpatch: add --showfile to allow input via pipe to show filenames
  ...
2015-06-26 09:52:05 -07:00
Rasmus Villemoes
ec3904dc65 fs/ext4/super.c: use strreplace() in ext4_fill_super()
This makes a very large function a little smaller.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:41 -07:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Miklos Szeredi
9bf39ab2ad vfs: add file_path() helper
Turn
	d_path(&file->f_path, ...);
into
	file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:00:05 -04:00
Theodore Ts'o
a2fd66d069 ext4: set lazytime on remount if MS_LAZYTIME is set by mount
Newer versions of mount parse the lazytime feature and pass it to the
mount system call via the flags field in the mount system call,
removing the lazytime string from the mount options list.  So we need
to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
order for this flag to be set on a remount.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-23 11:03:54 -04:00
Theodore Ts'o
c5e298ae53 ext4: prevent ext4_quota_write() from failing due to ENOSPC
In order to prevent quota block tracking to be inaccurate when
ext4_quota_write() fails with ENOSPC, we make two changes.  The quota
file can now use the reserved block (since the quota file is arguably
file system metadata), and ext4_quota_write() now uses
ext4_should_retry_alloc() to retry the block allocation after a commit
has completed and released some blocks for allocation.

This fixes failures of xfstests generic/270:

Quota error (device vdc): write_blk: dquota write failed
Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-21 01:25:29 -04:00
Theodore Ts'o
89d96a6f8e ext4: call sync_blockdev() before invalidate_bdev() in put_super()
Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device.  So
try to force them out to memory before calling invalidate_bdev().

This fixes a warning triggered by generic/081:

WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-20 22:50:33 -04:00
Andreas Dilger
b03a2f7eb2 ext4: improve warning directory handling messages
Several ext4_warning() messages in the directory handling code do not
report the inode number of the (potentially corrupt) directory where a
problem is seen, and others report this in an ad-hoc manner.  Add an
ext4_warning_inode() helper to print the inode number and command name
consistent with ext4_error_inode().

Consolidate the place in ext4.h that these macros are defined.

Clean up some other directory error and warning messages to print the
calling function name.

Minor code style fixes in nearby lines.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 14:50:26 -04:00
Theodore Ts'o
1cb767cd4a ext4 crypto: fail the mount if blocksize != pagesize
We currently don't correctly handle the case where blocksize !=
pagesize, so disallow the mount in those cases.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:44:33 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Chao Yu
e298e73bd7 ext4 crypto: release crypto resource on module exit
Crypto resource should be released when ext4 module exits, otherwise
it will cause memory leak.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:37:35 -04:00
Theodore Ts'o
c936e1ec28 ext4 crypto: use per-inode tfm structure
As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page.  Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.

Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:34:22 -04:00
Theodore Ts'o
f5aed2c2a8 ext4: clean up superblock encryption mode fields
The superblock fields s_file_encryption_mode and s_dir_encryption_mode
are vestigal, so remove them as a cleanup.  While we're at it, allow
file systems with both encryption and inline_data enabled at the same
time to work correctly.  We can't have encrypted inodes with inline
data, but there's no reason to prohibit unencrypted inodes from using
the inline data feature.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:18:47 -04:00
Theodore Ts'o
b7236e21d5 ext4 crypto: reorganize how we store keys in the inode
This is a pretty massive patch which does a number of different things:

1) The per-inode encryption information is now stored in an allocated
   data structure, ext4_crypt_info, instead of directly in the node.
   This reduces the size usage of an in-memory inode when it is not
   using encryption.

2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
   encryption structure instead.  This remove an unnecessary memory
   allocation and free for the fname_crypto_ctx as well as allowing us
   to reuse the ctfm in a directory for multiple lookups and file
   creations.

3) We also cache the inode's policy information in the ext4_crypt_info
   structure so we don't have to continually read it out of the
   extended attributes.

4) We now keep the keyring key in the inode's encryption structure
   instead of releasing it after we are done using it to derive the
   per-inode key.  This allows us to test to see if the key has been
   revoked; if it has, we prevent the use of the derived key and free
   it.

5) When an inode is released (or when the derived key is freed), we
   will use memset_explicit() to zero out the derived key, so it's not
   left hanging around in memory.  This implies that when a user logs
   out, it is important to first revoke the key, and then unlink it,
   and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
   release any decrypted pages and dcache entries from the system
   caches.

6) All this, and we also shrink the number of lines of code by around
   100.  :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:17:47 -04:00
Theodore Ts'o
e2881b1b51 ext4 crypto: separate kernel and userspace structure for the key
Use struct ext4_encryption_key only for the master key passed via the
kernel keyring.

For internal kernel space users, we now use struct ext4_crypt_info.
This will allow us to put information from the policy structure so we
can cache it and avoid needing to constantly looking up the extended
attribute.  We will do this in a spearate patch.  This patch is mostly
mechnical to make it easier for patch review.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:16:47 -04:00
Theodore Ts'o
1b46617b8d ext4: don't save the error information if the block device is read-only
Google-Bug-Id: 20939131
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-14 18:37:30 -04:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Linus Torvalds
6162e4b0be A few bug fixes and add support for file-system level encryption in ext4.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVMvVGAAoJEPL5WVaVDYGjjZgH/0Z4bdtQpuQKAd2EoSUhiOh4
 tReqE1IuTU+urrL9qNA4qUFhAKq0Iju0INrnoYNb1+YxZ2myvUrMY4y2GkapaKgZ
 SFYL8LTS7E79/LuR6q1SFmUYoXCjqpWeHb7rAZ9OluSNQhke8SWdywLnp/0q05Go
 6SDwYdT8trxGED/wYTGPy9zMHcYEYHqIIvfFZd3eYtRnaP42Zo5rUvISg3cP0ekG
 LiX2D9Bi9pyqxgMjTG0+0xiC3ohTfXOujyHbnLVQ7kdZmpzZKfQspoczEIUolYb4
 /Ic4qPQQdbtjooQ7uRYUOFXeVjt7HZuTb3aVmh90RWrEhsLsyBmNd9StLFVdlcg=
 =9f7Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A few bug fixes and add support for file-system level encryption in
  ext4"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
  ext4 crypto: enable encryption feature flag
  ext4 crypto: add symlink encryption
  ext4 crypto: enable filename encryption
  ext4 crypto: filename encryption modifications
  ext4 crypto: partial update to namei.c for fname crypto
  ext4 crypto: insert encrypted filenames into a leaf directory block
  ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames
  ext4 crypto: filename encryption facilities
  ext4 crypto: implement the ext4 decryption read path
  ext4 crypto: implement the ext4 encryption write path
  ext4 crypto: inherit encryption policies on inode and directory create
  ext4 crypto: enforce context consistency
  ext4 crypto: add encryption key management facilities
  ext4 crypto: add ext4 encryption facilities
  ext4 crypto: add encryption policy and password salt support
  ext4 crypto: add encryption xattr support
  ext4 crypto: export ext4_empty_dir()
  ext4 crypto: add ext4 encryption Kconfig
  ext4 crypto: reserve codepoints used by the ext4 encryption feature
  ext4 crypto: add ext4_mpage_readpages()
  ...
2015-04-19 14:26:31 -07:00
Theodore Ts'o
6ddb244784 ext4 crypto: enable encryption feature flag
Also add the test dummy encryption mode flag so we can more easily
test the encryption patches using xfstests.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-16 01:56:00 -04:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Michael Halcrow
b30ab0e034 ext4 crypto: add ext4 encryption facilities
On encrypt, we will re-assign the buffer_heads to point to a bounce
page rather than the control_page (which is the original page to write
that contains the plaintext). The block I/O occurs against the bounce
page.  On write completion, we re-assign the buffer_heads to the
original plaintext page.

On decrypt, we will attach a read completion callback to the bio
struct. This read completion will decrypt the read contents in-place
prior to setting the page up-to-date.

The current encryption mode, AES-256-XTS, lacks cryptographic
integrity. AES-256-GCM is in-plan, but we will need to devise a
mechanism for handling the integrity data.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:43:56 -04:00
Christoph Hellwig
08439fec26 ext4: remove block_device_ejected
bdi->dev now never goes away, so this function became useless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 23:56:32 -04:00
Sheng Yong
72b8e0f9fa ext4: remove unused header files
Remove unused header files and header files which are included in
ext4.h.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 23:47:42 -04:00
Jan Kara
0a240339a8 quota: Make VFS quotas use new interface for getting quota info
Create new internal interface for getting information about quota which
contains everything needed for both VFS quotas and XFS quotas. Make VFS
use this and hook it up to Q_GETINFO.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-03-04 16:06:34 +01:00
Linus Torvalds
feaf222925 Ext4 bug fixes for 3.20. We also reserved code points for encryption
and read-only images (for which the implementation is mostly just the
 reserved code point for a read-only feature :-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJU6lssAAoJENNvdpvBGATwpsEQAOcpCqj0gp/istbsoFpl5v5K
 +BU2aPvR5CPLtUQz9MqrVF5/6zwDbHGN+GIB6CEmh/qHIVQAhhS4XR+opSc7qqUr
 fAQ1AhL5Oh8Dyn9DRy5Io8oRv+wo5lRdD7aG7SPiizCMRQ34JwJ2sWIAwbP2Ea7W
 Xg51v3LWEu+UpqpgY3YWBoJKHj4hXwFvTVOCHs94239Y2zlcg2c4WwbKPzkvPcV/
 TvvZOOctty+l3FOB2bqFj3VnvywQmNv8/OixKjSprxlR7nuQlhKaLTWCtRjFbND4
 J/rk2ls5Bl79dnMvyVfV5ghpmGYBf5kkXCP716YsQkRCZUfNVrTOPJrNHZtYilAb
 opRo2UjAyTWxZBvyssnCorHJZUdxlYeIuSTpaG0zUbR0Y6p/7qd31F5k41GbBCFf
 B0lV3IaiVnXk23S2jFVHGhrzoKdFqu30tY7LMaO4xyGVMigOZJyBu8TZ7Utj9HmW
 /4GfjlvYqlfB7p+6yBkDv/87hjdmfMWIw48A7xWCiIeguQhB79gwTV7uAHVtgfng
 h5RF2EH/fx5klbAZx9vlaAh3pGFBHbh9fkeBmW9qNm7glz7aMUuxQaSo6X8HrCAJ
 LrECgDGbuiOHnMYuzZRERZiqwLB7JT82C1xopGzefsE/i0kN1eMjITkfggjQ5whu
 caLPn49tAb9U8P6TsPeE
 =PF+t
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Ext4 bug fixes.

  We also reserved code points for encryption and read-only images (for
  which the implementation is mostly just the reserved code point for a
  read-only feature :-)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix indirect punch hole corruption
  ext4: ignore journal checksum on remount; don't fail
  ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
  ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
  ext4: support read-only images
  ext4: change to use setup_timer() instead of init_timer()
  ext4: reserve codepoints used by the ext4 encryption feature
  jbd2: complain about descriptor block checksum errors
2015-02-22 18:05:13 -08:00
Linus Torvalds
038911597e Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
2015-02-17 16:12:34 -08:00
Ross Zwisler
923ae0ff92 ext4: add DAX functionality
This is a port of the DAX functionality found in the current version of
ext2.

[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Eric Sandeen
2d5b86e048 ext4: ignore journal checksum on remount; don't fail
As of v3.18, ext4 started rejecting a remount which changes the
journal_checksum option.

Prior to that, it was simply ignored; the problem here is that
if someone has this in their fstab for the root fs, now the box
fails to boot properly, because remount of root with the new options
will fail, and the box proceeds with a readonly root.

I think it is a little nicer behavior to accept the option, but
warn that it's being ignored, rather than failing the mount,
but that might be a subjective matter...

Reported-by: Cónräd <conradsand.arma@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 23:07:37 -05:00
Eric Sandeen
b94a8b36be ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
rejection of, changing journal_checksum during remount.  One suffices.

While we're at it, remove old comment about the "check" option
which has been deprecated for some time now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 23:04:27 -05:00
Darrick J. Wong
2cb5cc8b09 ext4: support read-only images
Add a rocompat feature, "readonly" to mark a FS image as read-only.
The feature prevents the kernel and e2fsprogs from changing the image;
the flag can be toggled by tune2fs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 22:31:21 -05:00
Linus Torvalds
6bec003528 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block
Pull backing device changes from Jens Axboe:
 "This contains a cleanup of how the backing device is handled, in
  preparation for a rework of the life time rules.  In this part, the
  most important change is to split the unrelated nommu mmap flags from
  it, but also removing a backing_dev_info pointer from the
  address_space (and inode), and a cleanup of other various minor bits.

  Christoph did all the work here, I just fixed an oops with pages that
  have a swap backing.  Arnd fixed a missing export, and Oleg killed the
  lustre backing_dev_info from staging.  Last patch was from Al,
  unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
  Make super_blocks and sb_lock static
  mtd: export new mtd_mmap_capabilities
  fs: make inode_to_bdi() handle NULL inode
  staging/lustre/llite: get rid of backing_dev_info
  fs: remove default_backing_dev_info
  fs: don't reassign dirty inodes to default_backing_dev_info
  nfs: don't call bdi_unregister
  ceph: remove call to bdi_unregister
  fs: remove mapping->backing_dev_info
  fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
  nilfs2: set up s_bdi like the generic mount_bdev code
  block_dev: get bdev inode bdi directly from the block device
  block_dev: only write bdev inode on close
  fs: introduce f_op->mmap_capabilities for nommu mmap support
  fs: kill BDI_CAP_SWAP_BACKED
  fs: deduplicate noop_backing_dev_info
2015-02-12 13:50:21 -08:00
Theodore Ts'o
a26f49926d ext4: add optimization for the lazytime mount option
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-05 02:45:00 -05:00
Jan Kara
1fa5efe362 ext4: Use generic helpers for quotaon and quotaoff
Ext4 can just use the generic helpers provided by quota code for turning
quotas on and off when quota files are stored as system inodes. The only
difference is the feature test in ext4_quota_on_sysfile() but the same
is achieved in dquot_quota_enable() by checking whether usage tracking
for the corresponding quota type is enabled (which can happen only if
quota feature is set).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-30 12:50:42 +01:00
Jan Mrazek
04ecddb73d ext4: change to use setup_timer() instead of init_timer()
Signed-off-by: Jan Mrazek <email@honzamrazek.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-26 14:42:31 -05:00
Christoph Hellwig
de1414a654 fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:04 -07:00
Linus Torvalds
3b421b80be Revert a potential seek_data/hole regression which shows up when using
ext4 to handle ext3 file systems, plus two minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUqxTQAAoJENNvdpvBGATwUC8QAIYfP02XYyGrBQIoAoCiRJji
 TmhAe2Amy9xHgBq2I8Yz3bi8j1wGuiqc57fYgwwVScf10tzmO/HuwnAZZDAmg9hK
 ZWp9WyPoyf+U/nbkIfC5mRh3Qz0dt1pt6R3uQDUlcUuAamdMBrdJnhkQC6WMbpU2
 fAqsJT3/yGrLnMF29eVqJzcxb5KORJ8hEcD7kwkvJwe4sGm3C7iDsjS0i63YWDz4
 QNclW6zF4THhmuVNxwRupOgMQNSq8sHg8U23nP4DZLvLE7GlgtwfDvehU7uBfw5n
 WO5UfsEYLoeODNmujUJCtjXNLpzDXmrtByyWbbTK7EX3MmV94ym4uu5lHLfyMiTc
 o2ppxcsKBVcOsPWnFwuhJ5p/Wyy0Uld9Q3P6b5ymhyzDhkuwcTURpeRxBRXHEgcm
 nY5GE1bBdO7OigDz/+DFL/Zgr8EO7hW72hrBaLDWMEbrrl0asZw/ReC/bnreMmm4
 sP87DB+MqRXzRs8aOPWmCofJwGSgCYmOq2nqNCAaxgk/ofvrDURrnZYfLrbzspGa
 hqE1W0X5hvQydcifi4qq2Na76+Js3atSY38EOH/HNknSqlQjysnkW4ajTDWk/GFy
 M/fKUCfIl1tmCMN2myZzl89E7uMSyod75ycd0BQy36iHPE14JVvk/u7GfcKHLs53
 1rAPpW90a72GX2Z9+xxA
 =uPYz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Revert a potential seek_data/hole regression which shows up when using
  ext4 to handle ext3 file systems, plus two minor bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove spurious KERN_INFO from ext4_warning call
  Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
  ext4: prevent online resize with backup superblock
2015-01-06 14:05:40 -08:00
Jakub Wilk
363307e6e5 ext4: remove spurious KERN_INFO from ext4_warning call
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02 15:31:14 -05:00
Linus Torvalds
9bfccec24e Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
 under heavy memory pressure, and Eric Whitney's bigalloc fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUiRUwAAoJENNvdpvBGATwltQP/3sjHtFw+RUvKgQ8vX9M2THk
 4b9j0ja0mrD3ObTXUxdDuOh1q09MsfSUiOYK6KZOav3nO/dRODqZnWgXz/zJt3LC
 R97s4velgzZi3F2ijnLiCo5RVZahN9xs8bUHZ85orMIr5wogwGdaUpnoqZSg0Ehr
 PIFnTNORyNXBwEm3XPjUmENTdyq9FZ8DsS6ACFzgFi79QTSyJFEM4LAl2XaqwMGV
 fVhNwnOGIyT8lHZAtDcobkaC86NjakmpW2Ip3p9/UEQtynh16UeVXKEO3K7CcQ+L
 YJRDNnSIlGpR1OJp+v6QJPUd8q4fc/8JW9AxxsLak0eqkszuB+MxoQXOCFV5AWaf
 jrs4TV3y0hCuB4OwuYUpnfcU1o+O7p39MqXMv8SA1ZBPbijN/LQSMErFtXj2oih6
 3gJHUWLwELGeR+d9JlI29zxhOeOIotX255UBgj2oasQ0X3BW3qAgQ4LmP3QY90Pm
 BUmxiMoIWB9N3kU4XQGf+Kyy8JeMLJj0frHDxI3XLz+B+IlWCCkBH6y3AD/a13kS
 HHMMLOwHGEs0lYEKsm89dkcij5GuKd8eKT8Q0+CvKD9Z6HPdYvQxoazmF87Q6j/7
 ZmshaVxtWaLpNbDaXVg+IgZifJAN0+mVzVHRhY9TSjx8k9qLdSgSEqYWjkSjx9Ij
 nNB2zVrHZDMvZ7MCZy85
 =ZrTc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Lots of bugs fixes, including Zheng and Jan's extent status shrinker
  fixes, which should improve CPU utilization and potential soft lockups
  under heavy memory pressure, and Eric Whitney's bigalloc fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
  ext4: fix suboptimal seek_{data,hole} extents traversial
  ext4: ext4_inline_data_fiemap should respect callers argument
  ext4: prevent fsreentrance deadlock for inline_data
  ext4: forbid journal_async_commit in data=ordered mode
  jbd2: remove unnecessary NULL check before iput()
  ext4: Remove an unnecessary check for NULL before iput()
  ext4: remove unneeded code in ext4_unlink
  ext4: don't count external journal blocks as overhead
  ext4: remove never taken branch from ext4_ext_shift_path_extents()
  ext4: create nojournal_checksum mount option
  ext4: update comments regarding ext4_delete_inode()
  ext4: cleanup GFP flags inside resize path
  ext4: introduce aging to extent status tree
  ext4: cleanup flag definitions for extent status tree
  ext4: limit number of scanned extents in status tree shrinker
  ext4: move handling of list of shrinkable inodes into extent status code
  ext4: change LRU to round-robin in extent status tree shrinker
  ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  ext4: fix block reservation for bigalloc filesystems
  ...
2014-12-12 09:28:03 -08:00
Jan Kara
d4f7610743 ext4: forbid journal_async_commit in data=ordered mode
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.

To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:19:17 -05:00
Eric Sandeen
b003b52496 ext4: don't count external journal blocks as overhead
This was fixed for ext3 with:

e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs

but was never fixed for ext4.

With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:27:44 -05:00
Darrick J. Wong
c6d3d56dd0 ext4: create nojournal_checksum mount option
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:20:50 -05:00
Dmitry Monakhov
4fdb554318 ext4: cleanup GFP flags inside resize path
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
 ->ext4_group_add
   ->journal_start
   ->ext4_setup_new_descs
     ->ext4_mb_add_groupinfo -> GFP_KERNEL
   ->ext4_flex_group_add
     ->ext4_update_super
       ->ext4_calculate_overhead  -> GFP_KERNEL
   ->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 13:08:04 -05:00
Jan Kara
dd47592551 ext4: limit number of scanned extents in status tree shrinker
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).

Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.

Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:51:23 -05:00
Jan Kara
b0dea4c165 ext4: move handling of list of shrinkable inodes into extent status code
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.

Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:49:25 -05:00
Zheng Liu
edaa53cac8 ext4: change LRU to round-robin in extent status tree shrinker
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.

We replace the lru ordering with a simple round-robin.  After that we
never need to keep a lru list.  That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:45:37 -05:00
Al Viro
b93b41d4c7 ext4: kill ext4_kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20 12:19:11 -05:00
Jan Kara
96c7e0d964 ext4: Convert to private i_dquot field
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10 10:06:11 +01:00
Darrick J. Wong
50460fe8c6 ext4: remove extent status procfs files if journal load fails
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Darrick J. Wong
6b992ff256 ext4: disallow changing journal_csum option during remount
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted.  Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:16 -04:00
Darrick J. Wong
98c1a7593f ext4: enable journal checksum when metadata checksum feature enabled
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Linus Torvalds
c2661b8060 A large number of cleanups and bug fixes, with some (minor) journal
optimizations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR
 Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN
 1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z
 8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i
 IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET
 ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY
 IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w
 M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0
 0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd
 fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ
 dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB
 9g4GKAXAyh15PeBTJ5K/
 =vWic
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A large number of cleanups and bug fixes, with some (minor) journal
  optimizations"

[ This got sent to me before -rc1, but was stuck in my spam folder.   - Linus ]

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
  ext4: check s_chksum_driver when looking for bg csum presence
  ext4: move error report out of atomic context in ext4_init_block_bitmap()
  ext4: Replace open coded mdata csum feature to helper function
  ext4: delete useless comments about ext4_move_extents
  ext4: fix reservation overflow in ext4_da_write_begin
  ext4: add ext4_iget_normal() which is to be used for dir tree lookups
  ext4: don't orphan or truncate the boot loader inode
  ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
  ext4: optimize block allocation on grow indepth
  ext4: get rid of code duplication
  ext4: fix over-defensive complaint after journal abort
  ext4: fix return value of ext4_do_update_inode
  ext4: fix mmap data corruption when blocksize < pagesize
  vfs: fix data corruption when blocksize < pagesize for mmaped data
  ext4: fold ext4_nojournal_sops into ext4_sops
  ext4: support freezing ext2 (nojournal) file systems
  ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
  ext4: don't check quota format when there are no quota files
  jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
  jbd2: avoid pointless scanning of checkpoint lists
  ...
2014-10-20 09:50:11 -07:00
Darrick J. Wong
813d32f913 ext4: check s_chksum_driver when looking for bg csum presence
Convert the ext4_has_group_desc_csum predicate to look for a checksum
driver instead of the metadata_csum flag and change the bg checksum
calculation function to look for GDT_CSUM before taking the crc16
path.

Without this patch, if we mount with ^uninit_bg,^metadata_csum and
later metadata_csum gets turned on by accident, the block group
checksum functions will incorrectly assume that checksumming is
enabled (metadata_csum) but that crc16 should be used
(!s_chksum_driver).  This is totally wrong, so fix the predicate
and the checksum formula selection.

(Granted, if the metadata_csum feature bit gets enabled on a live FS
then something underhanded is going on, but we could at least avoid
writing garbage into the on-disk fields.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2014-10-14 02:35:49 -04:00
Dmitry Monakhov
9aa5d32ba2 ext4: Replace open coded mdata csum feature to helper function
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized  csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum  $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-13 03:36:16 -04:00
Theodore Ts'o
f4bb298102 ext4: add ext4_iget_normal() which is to be used for dir tree lookups
If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes.  This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.

In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-05 22:56:00 -04:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Theodore Ts'o
f6e63f9080 ext4: fold ext4_nojournal_sops into ext4_sops
There's no longer any need to have a separate set of super_operations
for nojournal mode.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 17:12:30 -04:00
Theodore Ts'o
bb04457658 ext4: support freezing ext2 (nojournal) file systems
Through an oversight, when we added nojournal support to ext4, we
didn't add support to allow file system freezing.  This is relatively
easy to add, so let's do it.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dexuan Cui <decui@microsoft.com>
2014-09-18 17:12:02 -04:00
Theodore Ts'o
bda3253043 ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
This allows us to eliminate duplicate code, and eventually allow us to
also fold ext4_sops and ext4_nojournal_sops together.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 16:12:37 -04:00
Jan Kara
279bf6d390 ext4: don't check quota format when there are no quota files
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 01:12:15 -04:00
Dmitry Monakhov
844749764b ext4: explicitly inform user about orphan list cleanup
Production fs likely compiled/mounted w/o jbd debugging, so orphan
list clearing will be silent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16 14:52:03 -04:00
Darrick J. Wong
df4763bea5 ext4: validate external journal superblock checksum
If the external journal device has metadata_csum enabled, verify
that the superblock checksum matches the block before we try to
mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:44:36 -04:00
Darrick J. Wong
feb8c6d3dd jbd2: fix journal checksum feature flag handling
Clear all three journal checksum feature flags before turning on
whichever journal checksum options we want.  Rearrange the error
checking so that newer flags get complained about first.

Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:38:21 -04:00
Lukas Czerner
c7f725435a ext4: provide separate operations for sysfs feature files
Currently sysfs feature files uses ext4_attr_ops as the file operations
to show/store data. However the feature files is not supposed to contain
any data at all, the sole existence of the file means that the module
support the feature. Moreover, none of the sysfs feature attributes
actually register show/store functions so that would not be a problem.

However if a sysfs feature attribute register a show or store function
we might be in trouble because the kobject in this case is _not_ embedded
in the ext4_sb_info structure as ext4_attr_show/store expect.

So just to be safe, provide separate empty sysfs_ops to use in
ext4_feat_ktype. This might safe us from potential problems in the
future. As a bonus we can "store" something more descriptive than
nothing in the files, so let it contain "enabled" to make it clear that
the feature is really present in the module.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:27:58 -04:00
Lukas Czerner
52c198c682 ext4: add sysfs entry showing whether the fs contains errors
Currently there is no easy way to tell that the mounted file system
contains errors other than checking for log messages, or reading the
information directly from superblock.

This patch adds new sysfs entries:

errors_count		(number of fs errors we encounter)
first_error_time	(unix timestamp for the first error we see)
last_error_time		(unix timestamp for the last error we see)

If the file system is not marked as containing errors then any of the
file will return 0. Otherwise it will contain valid information. More
details about the errors should as always be found in the logs.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:18:13 -04:00
Jan Kara
a2d4a646e6 ext4: don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ext4 supports. Although
ext4 will support project quotas, use ext4 private definition for
consistency with other filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:15:15 -04:00
Tejun Heo
908c7f1949 percpu_counter: add @gfp to percpu_counter_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion.  This is the one with
the most users.  Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
2014-09-08 09:51:29 +09:00
Gioh Kim
a8ac900b81 ext4: use non-movable memory for the ext4 superblock
Since the ext4 superblock is not released until the file system is
unmounted, allocate the buffer cache entry for the ext4 superblock out
of the non-moveable are to allow page migrations and thus CMA
allocations to more easily succeed if the CMA area is limited.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 22:36:15 -04:00
Zheng Liu
eb68d0e2fc ext4: track extent status tree shrinker delay statictics
This commit adds some statictics in extent status tree shrinker.  The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker.  Here we count
the following statictics:
  stats:
    the number of all objects on all extent status trees
    the number of reclaimable objects on lru list
    cache hits/misses
    the last sorted interval
    the number of inodes on lru list
  average:
    scan time for shrinking some objects
    the number of shrunk objects
  maximum:
    the inode that has max nr. of objects on lru list
    the maximum scan time for shrinking some objects

The output looks like below:
  $ cat /proc/fs/ext4/sda1/es_shrinker_info
  stats:
    28228 objects
    6341 reclaimable objects
    5281/631 cache hits/misses
    586 ms last sorted interval
    250 inodes on lru list
  average:
    153 us scan time
    128 shrunk objects
  maximum:
    255 inode (255 objects, 198 reclaimable)
    125723 us max scan time

If the lru list has never been sorted, the following line will not be
printed:
    586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
    250 inodes on lru list
  ...
  maximum:
    255 inode (255 objects, 198 reclaimable)
    0 us max scan time

Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:26:49 -04:00
Darrick J. Wong
45f1a9c3f6 ext4: enable block_validity by default
Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 21:34:09 -04:00
Theodore Ts'o
1c2150283c ext4: convert ext4_bread() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:15 -04:00
Darrick J. Wong
db9ee22036 jbd2: fix descriptor block size handling errors with journal_csum
It turns out that there are some serious problems with the on-disk
format of journal checksum v2.  The foremost is that the function to
calculate descriptor tag size returns sizes that are too big.  This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.

Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.

Add a few function helpers so we don't have to open-code quite so
many pieces.

Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:29 -04:00
Theodore Ts'o
d5e03cbb0c ext4: rearrange initialization to fix EXT4FS_DEBUG
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
ext4 developers only.  (You have to modify fs/ext4/ext4.h to enable
it.)

Rearrange how we initialize data structures to avoid calling
ext4_count_free_clusters() until the multiblock allocator has been
initialized.

This also allows us to only call ext4_count_free_clusters() once, and
simplifies the code somewhat.

(Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
!CONFIG_SMP compile breakage in the original patch.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15 06:01:38 -04:00
Theodore Ts'o
f9ae9cf5d7 ext4: revert commit which was causing fs corruption after journal replays
Commit 007649375f ("ext4: initialize multi-block allocator before
checking block descriptors") causes the block group descriptor's count
of the number of free blocks to become inconsistent with the number of
free blocks in the allocation bitmap.  This is a harmless form of fs
corruption, but it causes the kernel to potentially remount the file
system read-only, or to panic, depending on the file systems's error
behavior.

Thanks to Eric Whitney for his tireless work to reproduce and to find
the guilty commit.

Fixes: 007649375f ("ext4: initialize multi-block allocator before checking block descriptors"

Cc: stable@vger.kernel.org  # 3.15
Reported-by: David Jander <david@protonic.nl>
Reported-by: Matteo Croce <technoboy85@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Suggested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-11 13:55:40 -04:00
Eric Sandeen
5dd214248f ext4: disable synchronous transaction batching if max_batch_time==0
The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:18:22 -04:00
Theodore Ts'o
ae0f78de2c ext4: clarify error count warning messages
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-07-05 18:40:52 -04:00
liang xie
5d60125530 ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 22:06:43 -04:00
Lukas Czerner
c8b459f492 ext4: remove unnecessary double parentheses
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 12:55:07 -04:00
Stephen Hemminger
c197855ea1 ext4: make local functions static
I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:50:23 -04:00
Darrick J. Wong
bd63f6b0cd ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
On a filesystem with a 1k block size, the group descriptors live in
block 2, not block 1.  If the filesystem has bigalloc,meta_bg set,
however, the calculation of the group descriptor table location does
not take this into account and returns the wrong block number.  Fix
the calculation to return the correct value for this case.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:06:27 -04:00
Theodore Ts'o
202ee5df38 ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
To avoid potential data races, use a spinlock which protects the raw
(on-disk) inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:37:55 -04:00
Lukas Czerner
556615dcbf ext4: rename uninitialized extents to unwritten
Currently in ext4 there is quite a mess when it comes to naming
unwritten extents. Sometimes we call it uninitialized and sometimes we
refer to it as unwritten.

The right name for the extent which has been allocated but does not
contain any written data is _unwritten_. Other file systems are
using this name consistently, even the buffer head state refers to it as
unwritten. We need to fix this confusion in ext4.

This commit changes every reference to an uninitialized extent (meaning
allocated but unwritten) to unwritten extent. This includes comments,
function names and variable names. It even covers abbreviation of the
word uninitialized (such as uninit) and some misspellings.

This commit does not change any of the code paths at all. This has been
confirmed by comparing md5sums of the assembly code of each object file
after all the function names were stripped from it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20 23:45:47 -04:00
Azat Khuzhin
007649375f ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.

And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
  because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
  number of clusters returned by ext4_free_clusters_after_init() is not correct.
  (see ext4_bg_num_gdb_nometa())

Here is the stack backtrace:

(gdb) bt
 #0  ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
 #1  ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
     desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
 #2  0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
 #3  0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0) at balloc.c:489
 #4  0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
 #5  0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
     sb=0xffff880079a10000) at super.c:2143
 #6  ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
     data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
     ...

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 10:54:20 -04:00
Theodore Ts'o
ed3654eb98 ext4: optimize Hurd tests when reading/writing inodes
Set a in-memory superblock flag to indicate whether the file system is
designed to support the Hurd.

Also, add a sanity check to make sure the 64-bit feature is not set
for Hurd file systems, since i_file_acl_high conflicts with a
Hurd-specific field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24 14:09:06 -04:00
T Makphaibulchoke
9c191f701c ext4: each filesystem creates and uses its own mb_cache
This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
the cache creation and destory calls from ex4_init_xattr() and
ext4_exitxattr() in fs/ext4/xattr.c.

fs/ext4/super.c has been changed so that when a filesystem is mounted
a cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is
allocated and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
2014-03-18 19:24:49 -04:00
Theodore Ts'o
38c03b3439 ext4: only call sync_filesystm() when remounting read-only
This is the only time it is required for ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-13 22:49:42 -04:00
Theodore Ts'o
02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Fabian Frederick
e67bc2b359 ext4: Add __init marking to init_inodecache
init_inodecache is only called by __init init_ext4_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-17 20:34:53 -05:00
Theodore Ts'o
2330141097 ext4: don't try to modify s_flags if the the file system is read-only
If an ext4 file system is created by some tool other than mke2fs
(perhaps by someone who has a pathalogical fear of the GPL) that
doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
and that file system is then mounted read-only, don't try to modify
the s_flags field.  Otherwise, if dm_verity is in use, the superblock
will change, causing an dm_verity failure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-02-12 12:16:04 -05:00
Jan Kara
30fac0f75d ext4: Do not reserve clusters when fs doesn't support extents
When the filesystem doesn't support extents (like in ext2/3
compatibility modes), there is no need to reserve any clusters. Space
estimates for writing are exact, hole punching doesn't need new
metadata, and there are no unwritten extents to convert.

This fixes a problem when filesystem still having some free space when
accessed with a native ext2/3 driver suddently reports ENOSPC when
accessed with ext4 driver.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 21:11:59 -05:00
Al Viro
9105bb149b ext4: fix del_timer() misuse for ->s_err_report
That thing should be del_timer_sync(); consider what happens
if ext4_put_super() call of del_timer() happens to come just as it's
getting run on another CPU.  Since that timer reschedules itself
to run next day, you are pretty much guaranteed that you'll end up
with kfree'd scheduled timer, with usual fun consequences.  AFAICS,
that's -stable fodder all way back to 2010... [the second del_timer_sync()
is almost certainly not needed, but it doesn't hurt either]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 20:52:31 -05:00
Theodore Ts'o
dd1f723bf5 ext4: use prandom_u32() instead of get_random_bytes()
Many of the uses of get_random_bytes() do not actually need
cryptographically secure random numbers.  Replace those uses with a
call to prandom_u32(), which is faster and which doesn't consume
entropy from the /dev/random driver.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-08 00:14:53 -05:00
Theodore Ts'o
efbed4dc58 ext4: add ratelimiting to ext4 messages
In the case of a storage device that suddenly disappears, or in the
case of significant file system corruption, this can result in a huge
flood of messages being sent to the console.  This can overflow the
file system containing /var/log/messages, or if a serial console is
configured, this can slow down the system so much that a hardware
watchdog can end up triggering forcing a system reboot.

Google-Bug-Id: 7258357

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-17 21:11:01 -04:00
Linus Torvalds
2e515bf096 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "The usual trivial updates all over the tree -- mostly typo fixes and
  documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
  doc: Documentation/cputopology.txt fix typo
  treewide: Convert retrun typos to return
  Fix comment typo for init_cma_reserved_pageblock
  Documentation/trace: Correcting and extending tracepoint documentation
  mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
  power: Documentation: Update s2ram link
  doc: fix a typo in Documentation/00-INDEX
  Documentation/printk-formats.txt: No casts needed for u64/s64
  doc: Fix typo "is is" in Documentations
  treewide: Fix printks with 0x%#
  zram: doc fixes
  Documentation/kmemcheck: update kmemcheck documentation
  doc: documentation/hwspinlock.txt fix typo
  PM / Hibernate: add section for resume options
  doc: filesystems : Fix typo in Documentations/filesystems
  scsi/megaraid fixed several typos in comments
  ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
  treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
  page_isolation: Fix a comment typo in test_pages_isolated()
  doc: fix a typo about irq affinity
  ...
2013-09-06 09:36:28 -07:00
Linus Torvalds
45d9a2220f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "Unfortunately, this merge window it'll have a be a lot of small piles -
  my fault, actually, for not keeping #for-next in anything that would
  resemble a sane shape ;-/

  This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
  cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
  components) + several long-standing patches from various folks.

  There definitely will be a lot more (starting with Miklos'
  check_submount_and_drop() series)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  direct-io: Handle O_(D)SYNC AIO
  direct-io: Implement generic deferred AIO completions
  add formats for dentry/file pathnames
  kvm eventfd: switch to fdget
  powerpc kvm: use fdget
  switch fchmod() to fdget
  switch epoll_ctl() to fdget
  switch copy_module_from_fd() to fdget
  git simplify nilfs check for busy subtree
  ibmasmfs: don't bother passing superblock when not needed
  don't pass superblock to hypfs_{mkdir,create*}
  don't pass superblock to hypfs_diag_create_files
  don't pass superblock to hypfs_vm_create_files()
  oprofile: get rid of pointless forward declarations of struct super_block
  oprofilefs_create_...() do not need superblock argument
  oprofilefs_mkdir() doesn't need superblock argument
  don't bother with passing superblock to oprofile_create_stats_files()
  oprofile: don't bother with passing superblock to ->create_files()
  don't bother passing sb to oprofile_create_files()
  coh901318: don't open-code simple_read_from_buffer()
  ...
2013-09-05 08:50:26 -07:00
Christoph Hellwig
7b7a8665ed direct-io: Implement generic deferred AIO completions
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue.  This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.

The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.

Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara.  I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.

JK: Fixed ext4 part, dynamic allocation of the workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04 09:23:46 -04:00
Eric Sandeen
ad4eec6135 ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.

The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.

Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.

# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

and it'll mount even if the devnum has changed, as shown here:

# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1

Change the journal device number:

# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile 

And today it will fail:

# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal

But with this new mount option, we can specify the new path:

# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#

(which does update the encoded device number, incidentally):

# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device:	          0x0701

But best of all we can just always mount by journal-path, and
it'll always work:

# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#

So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 19:05:07 -04:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Piotr Sarna
6ae6514b33 ext4: fix mount/remount error messages for incompatible mount options
Commit 5688978 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.

First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.

Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored.  Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.

To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.

Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:02:24 -04:00
Theodore Ts'o
59d9fa5c2e ext4: allow the mount options nodelalloc and data=journal
Commit 26092bf ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously.  This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.

Reported-by: Piotr Sarna <p.sarna@partner.samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:01:24 -04:00
Eric Sandeen
dd12ed144e ext4: destroy ext4_es_cachep on module unload
Without this, module can't be reloaded.

[  500.521980] kmem_cache_sanity_check (ext4_extent_status): Cache name already exists.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org  # v3.8+
2013-07-26 15:21:11 -04:00
Theodore Ts'o
ad065dd016 ext4: don't show usrquota/grpquota twice in /proc/mounts
We now print mount options in a generic fashion in
ext4_show_options(), so we shouldn't be explicitly printing the
{usr,grp}quota options in ext4_show_quota_options().

Without this patch, /proc/mounts can look like this:

 /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0
                                      ^^^^^^^^              ^^^^^^^^

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-11 18:54:37 -04:00
Theodore Ts'o
960fd856fd ext4: fix ext4_get_group_number()
The function ext4_get_group_number() was introduced as an optimization
in commit bd86298e60.  Unfortunately, this commit incorrectly
calculate the group number for file systems with a 1k block size (when
s_first_data_block is 1 instead of zero).  This could cause the
following kernel BUG:

[  568.877799] ------------[ cut here ]------------
[  568.877833] kernel BUG at fs/ext4/mballoc.c:3728!
[  568.877840] Oops: Exception in kernel mode, sig: 5 [#1]
[  568.877845] SMP NR_CPUS=32 NUMA pSeries
[  568.877852] Modules linked in: binfmt_misc
[  568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809f-dirty #1
[  568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000
[  568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8
[  568.877879] REGS: c0000001fa956ed0 TRAP: 0700   Not tainted  (3.10.0-03216-g7c6809f-dirty)
[  568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 24000428  XER: 00000000
[  568.877902] SOFTE: 1
[  568.877905] CFAR: c0000000002b5464
[  568.877908]
GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000
GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f
GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000
GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0
GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672
GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000
GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288
GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10
[  568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0
[  568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0
[  568.878004] Call Trace:
[  568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable)
[  568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444
[  568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734
[  568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4
[  568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178
[  568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490
[  568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60
[  568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80
[  568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0
[  568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc
[  568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4
[  568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80
[  568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c
[  568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90
[  568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c
[  568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c
[  568.878125] Instruction dump:
[  568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70
[  568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78
[  568.878160] ---[ end trace 594d911d9654770b ]---

In addition fix the STD_GROUP optimization so that it works for
bigalloc file systems as well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Li Zhong <lizhongfs@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org  # 3.10
2013-07-05 23:11:16 -04:00
Joe Perches
e7c96e8e47 ext4: reduce object size when !CONFIG_PRINTK
Reduce the object size ~10% could be useful for embedded systems.

Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.

$ size fs/ext4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
 264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Zheng Liu
d3922a777f ext4: improve extent cache shrink mechanism to avoid to burn CPU time
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Jon Ernst
03b40e3496 ext4: delete unused variables
This patch removed several unused variables.

Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-17 08:56:26 -04:00
Dmitry Monakhov
4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov
06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Jan Kara
2e8fa54e3b ext4: split extent conversion lists to reserved & unreserved parts
Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did).  So split lists, work items, and work queues to
reserved and unreserved parts.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:21:02 -04:00
Jan Kara
f2d50a65c9 ext4: deprecate max_writeback_mb_bump sysfs attribute
This attribute is now unused so deprecate it.  We still show the old
default value to keep some compatibility but we don't allow writing to
that attribute anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:51:16 -04:00
Paul Taysom
566370a2e5 ext4: suppress ext4 orphan messages on mount
Suppress the messages releating to processing the ext4 orphan list
("truncating inode" and "deleting unreferenced inode") unless the
debug option is on, since otherwise they end up taking up space in the
log that could be used for more useful information.

Tested by opening several files, unlinking them, then
crashing the system, rebooting the system and examining
/var/log/messages.

Addresses the problem described in http://crbug.com/220976

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-05-28 07:51:21 -04:00
Lukas Czerner
55f252c9f5 ext4: truncate_inode_pages() in orphan cleanup path
Currently we do not tell mm to zero out tail of the page before truncate
in orphan_cleanup(). This is ok, because the page should not be
uptodate, however this may eventually change and I might cause problems.

Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara
for pointing this out.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Al Viro
4385bab128 make blkdev_put() return void
same story as with the previous patches - note that return
value of blkdev_close() is lost, since there's nowhere the
caller (__fput()) could return it to.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:31 -04:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Theodore Ts'o
9f203507ed ext4: mark all metadata I/O with REQ_META
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's
important that metadata I/O requests are marked as such to avoid
priority inversions caused by I/O bandwidth throttling.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-20 15:46:17 -04:00
Lukas Czerner
e1091b157c ext4: Use kstrtoul() instead of parse_strtoul()
In parse_strtoul() we're still using deprecated simple_strtoul().  Remove
parse_strtoul() altogether and replace it with kstrtoul()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11 23:37:19 -04:00
Theodore Ts'o
d6a771056b ext4: fix miscellaneous big endian warnings
None of these result in any bug, but they makes sparse complain.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 23:59:55 -04:00
Lukas Czerner
27dd438542 ext4: introduce reserved space
Currently in ENOSPC condition when writing into unwritten space, or
punching a hole, we might need to split the extent and grow extent tree.
However since we can not allocate any new metadata blocks we'll have to
zero out unwritten part of extent or punched out part of extent, or in
the worst case return ENOSPC even though use actually does not allocate
any space.

Also in delalloc path we do reserve metadata and data blocks for the
time we're going to write out, however metadata block reservation is
very tricky especially since we expect that logical connectivity implies
physical connectivity, however that might not be the case and hence we
might end up allocating more metadata blocks than previously reserved.
So in future, metadata reservation checks should be removed since we can
not assure that we do not under reserve.

And this is where reserved space comes into the picture. When mounting
the file system we slice off a little bit of the file system space (2%
or 4096 clusters, whichever is smaller) which can be then used for the
cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.

The number of reserved clusters can be set via sysfs, however it can
never be bigger than number of free clusters in the file system.

Note that this patch fixes the failure of xfstest 274 as expected.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-09 22:11:22 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Jan Kara
bcb1385096 ext4: fix deadlock with quota feature
We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was
accounted even for quota files. Thus we could recurse back to quota code
when adding new blocks to quota file which can easily deadlock. Mark
hidden quota files properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 09:21:41 -04:00
Lukas Czerner
689110098c ext4: make ext4_block_in_group() much more efficient
Currently in when getting the block group number for a particular
block in ext4_block_in_group() we're using
ext4_get_group_no_and_offset() which uses do_div() to get the block
group and the remainer which is offset within the group.

We don't need all of that in ext4_block_in_group() as we only need to
figure out the group number.

This commit changes ext4_block_in_group() to calculate group number
directly. This shows as a big improvement with regards to cpu
utilization. Measuring fallocate -l 15T on fresh file system with perf
showed that 23% of cpu time was spend in the
ext4_get_group_no_and_offset(). With this change it completely
disappears from the list only bumping the occurrence of
ext4_init_block_bitmap() which is the biggest user of
ext4_block_in_group() by 4%. As the result of this change on my system
the fallocate call was approx. 10% faster.

However since there is '-g' option in mkfs which allow us setting
different groups size (mostly for developers) I've introduced new per
file system flag whether we have a standard block group size or
not. The flag is used to determine whether we can use the bit shift
optimization or not.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 22:12:52 -04:00
Dmitry Monakhov
a75ae78f08 ext4: unregister es_shrinker if mount failed
Otherwise destroyed ext_sb_info will be part of global shinker list
and result in the following OOPS:

JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<ffffffff81249381>] mount_bdev+0x331/0x340
[<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<ffffffff81362035>] ext4_mount+0x15/0x20
[<ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<ffffffff8127aa56>] do_mount+0x376/0x380
[<ffffffff8127ab98>] sys_mount+0x138/0x150
[<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
8
RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP <ffff88011d5cbcd8>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-03 22:10:52 -04:00
Dmitry Monakhov
5d3ee20855 ext4: fix journal callback list traversal
It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This results in the following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com
2013-04-03 22:08:52 -04:00
Linus Torvalds
d3c926264a Fix a number of regression and other bugs in ext4, most of which were
relatively obscure cornercases or races that were found using
 regression tests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRSm5lAAoJENNvdpvBGATwZW8QAN7jMn7IaVCTXXblqgqba4uN
 KvLGRgK7R/n1rIhdHoxJHumwRQLTppVzjDCc8ePnWhdypzMZNuzUvs+OoCFdkDsW
 qf3CmL/p/R1oSiSzzFIs/7wGp7xBZ0l0BWZMFWd9EUg9cqoMBDA6KzcMF95fOtas
 KsjRL+BThacVldS7jyKFwE4BrpXd0Z5V9qZ6wjQPPoBx8sXF4iYA+CZVo5FUKBs8
 6I82LS1/PIYCe3IOSpCgyKXQqRzAYJANv1ndken5wW8jWT2R58e360OwZEVcpIN9
 /caov+F5OKfk4iOGq3b+vwRplNhAI2S6C4vhMbmS2GPWE8Fnr8gubyqNAIIs5R/y
 3zYHdqZESfuEF7K3QoAepiJhi3YIoRxXC1FxD7uxx7VBRhW2w8Ij5hlXhuSoh24M
 MUiXgCeIxQb+ZfUx0OHV++LSOHVccU4y7Z0X+LpXQa6tEMBuSgK6yCKsGkyr8APN
 gPMupTptgyUE3tFaCjqc7QKtmoeRAMSvzfqEyV6DlblIOe+3f/RJzRO222Xc4kxq
 D9t2tOuPoXsR+ivtS5pEcrZkE4Y2hkJbJzb7XXvfoETixYsuX6VkiPK/D68S9eRe
 VelqTM2lHPJi/3Wkle0p4pzWpEq70D8qZVp4TKLHMJCTQKpwUfopm5lvln87lc7w
 4JDORIx/ed1u8MMTJlmG
 =X3vc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of regression and other bugs in ext4, most of which were
  relatively obscure cornercases or races that were found using
  regression tests."

* tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: fix data=journal fast mount/umount hang
  ext4: fix ext4_evict_inode() racing against workqueue processing code
  ext4: fix memory leakage in mext_check_coverage
  ext4: use s_extent_max_zeroout_kb value as number of kb
  ext4: use atomic64_t for the per-flexbg free_clusters count
  jbd2: fix use after free in jbd2_journal_dirty_metadata()
  ext4: reserve metadata block for every delayed write
  ext4: update reserved space after the 'correction'
  ext4: do not use yield()
  ext4: remove unused variable in ext4_free_blocks()
  ext4: fix WARN_ON from ext4_releasepage()
  ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
  ext4: update extent status tree after an extent is zeroed out
  ext4: fix wrong m_len value after unwritten extent conversion
  ext4: add self-testing infrastructure to do a sanity check
  ext4: avoid a potential overflow in ext4_es_can_be_merged()
  ext4: invalidate extent status tree during extent migration
  ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
  ext4: add warning to ext4_convert_unwritten_extents_endio
  ext4: disable merging of uninitialized extents
  ...
2013-03-21 17:56:10 -07:00
Eric W. Biederman
fa7614ddd6 fs: Readd the fs module aliases.
I had assumed that the only use of module aliases for filesystems
prior to "fs: Limit sys_mount to only request filesystem modules."
was in request_module.  It turns out I was wrong.  At least mkinitcpio
in Arch linux uses these aliases.

So readd the preexising aliases, to keep from breaking userspace.

Userspace eventually will have to follow and use the same aliases the
kernel does.  So at some point we may be delete these aliases without
problems.  However that day is not today.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-12 18:55:21 -07:00