According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
should not return bh = NULL
The only time that bh could be NULL, then, would be in the case of
something really going wrong; a programming error elsewhere (perhaps a
wild pointer dereference) or I/O error causing on-disk file system
corruption (although that would be highly unlikely given that we had
*just* allocated the blocks and so the metadata blocks in question
probably would still be in the cache).
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.13
bs.bh was taken in previous ext4_xattr_block_find() call,
it should be released before re-using
Fixes: 7e01c8e542 ("ext3/4: fix uninitialized bs in ...")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 2.6.26
Extended attribute names are defined to be NUL-terminated, so the name
must not contain a NUL character. This is important because there are
places when remove extended attribute, the code uses strlen to
determine the length of the entry. That should probably be fixed at
some point, but code is currently really messy, so the simplest fix
for now is to simply validate that the extended attributes are sane.
https://bugzilla.kernel.org/show_bug.cgi?id=200401
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Use a separate journal transaction if it turns out that we need to
convert an inline file to use an data block. Otherwise we could end
up failing due to not having journal credits.
This addresses CVE-2018-10883.
https://bugzilla.kernel.org/show_bug.cgi?id=200071
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
When expanding the extra isize space, we must never move the
system.data xattr out of the inode body. For performance reasons, it
doesn't make any sense, and the inline data implementation assumes
that system.data xattr is never in the external xattr block.
This addresses CVE-2018-10880
https://bugzilla.kernel.org/show_bug.cgi?id=200005
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
If there an inode points to a block which is also some other type of
metadata block (such as a block allocation bitmap), the
buffer_verified flag can be set when it was validated as that other
metadata block type; however, it would make a really terrible external
attribute block. The reason why we use the verified flag is to avoid
constantly reverifying the block. However, it doesn't take much
overhead to make sure the magic number of the xattr block is correct,
and this will avoid potential crashes.
This addresses CVE-2018-10879.
https://bugzilla.kernel.org/show_bug.cgi?id=200001
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
In theory this should have been caught earlier when the xattr list was
verified, but in case it got missed, it's simple enough to add check
to make sure we don't overrun the xattr buffer.
This addresses CVE-2018-10879.
https://bugzilla.kernel.org/show_bug.cgi?id=200001
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Ext4 will always create ext4 extended attributes which do not have a
value (where e_value_size is zero) with e_value_offs set to zero. In
most places e_value_offs will not be used in a substantive way if
e_value_size is zero.
There was one exception to this, which is in ext4_xattr_set_entry(),
where if there is a maliciously crafted file system where there is an
extended attribute with e_value_offs is non-zero and e_value_size is
0, the attempt to remove this xattr will result in a negative value
getting passed to memmove, leading to the following sadness:
[ 41.225365] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 44.538641] BUG: unable to handle kernel paging request at ffff9ec9a3000000
[ 44.538733] IP: __memmove+0x81/0x1a0
[ 44.538755] PGD 1249bd067 P4D 1249bd067 PUD 1249c1067 PMD 80000001230000e1
[ 44.538793] Oops: 0003 [#1] SMP PTI
[ 44.539074] CPU: 0 PID: 1470 Comm: poc Not tainted 4.16.0-rc1+ #1
...
[ 44.539475] Call Trace:
[ 44.539832] ext4_xattr_set_entry+0x9e7/0xf80
...
[ 44.539972] ext4_xattr_block_set+0x212/0xea0
...
[ 44.540041] ext4_xattr_set_handle+0x514/0x610
[ 44.540065] ext4_xattr_set+0x7f/0x120
[ 44.540090] __vfs_removexattr+0x4d/0x60
[ 44.540112] vfs_removexattr+0x75/0xe0
[ 44.540132] removexattr+0x4d/0x80
...
[ 44.540279] path_removexattr+0x91/0xb0
[ 44.540300] SyS_removexattr+0xf/0x20
[ 44.540322] do_syscall_64+0x71/0x120
[ 44.540344] entry_SYSCALL_64_after_hwframe+0x21/0x86
https://bugzilla.kernel.org/show_bug.cgi?id=199347
This addresses CVE-2018-10840.
Reported-by: "Xu, Wen" <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
Fixes: dec214d00e ("ext4: xattr inode deduplication")
Add explicit checks in ext4_xattr_block_get() just in case the
e_value_offs and e_value_size fields in the the xattr block are
corrupted in memory after the buffer_verified bit is set on the xattr
block.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Add some paranoia checks to make sure we don't stray beyond the end of
the valid memory region containing ext4 xattr entries while we are
scanning for a match.
Also rename the function to xattr_find_entry() since it is static and
thus only used in fs/ext4/xattr.c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Refactor the call to EXT4_ERROR_INODE() into ext4_xattr_check_block().
This simplifies the code, and fixes a problem where not all callers of
ext4_xattr_check_block() were not resulting in ext4_error() getting
called when the xattr block is corrupted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
ext4 isn't validating the sizes of xattrs where the value of the xattr
is stored in an external inode. This is problematic because
->e_value_size is a u32, but ext4_xattr_get() returns an int. A very
large size is misinterpreted as an error code, which ext4_get_acl()
translates into a bogus ERR_PTR() for which IS_ERR() returns false,
causing a crash.
Fix this by validating that all xattrs are <= INT_MAX bytes.
This issue has been assigned CVE-2018-1095.
https://bugzilla.kernel.org/show_bug.cgi?id=199185https://bugzilla.redhat.com/show_bug.cgi?id=1560793
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Original Lustre ea_inode feature did not have ref counts on xattr inodes
because there was always one parent that referenced it. New
implementation expects ref count to be initialized which is not true for
Lustre case. Handle this by detecting Lustre created xattr inode and set
its ref count to 1.
The quota handling of xattr inodes have also changed with deduplication
support. New implementation manually manages quotas to support sharing
across multiple users. A consequence is that, a referencing inode
incorporates the blocks of xattr inode into its own i_block field.
We need to know how a xattr inode was created so that we can reverse the
block charges during reference removal. This is handled by introducing a
EXT4_STATE_LUSTRE_EA_INODE flag. The flag is set on a xattr inode if
inode appears to have been created by Lustre. During xattr inode reference
removal, the manual quota uncharge is skipped if the flag is set.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When updating an extended attribute, if the padded value sizes are the
same, a shortcut is taken to avoid the bulk of the work. This was fine
until the xattr hash update was moved inside ext4_xattr_set_entry().
With that change, the hash update got missed in the shortcut case.
Thanks to ZhangYi (yizhang089@gmail.com) for root causing the problem.
Fixes: daf8328172 ("ext4: eliminate xattr entry e_hash recalculation for removes")
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Clean up some goto statement, make ext4_expand_extra_isize_ea() clearer.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
Current ext4_expand_extra_isize just tries to expand extra isize, if
someone is holding xattr lock or some check fails, it will give up.
So rename its name to ext4_try_to_expand_extra_isize.
Besides that, we clean up unnecessary check and move some relative checks
into it.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
We should avoid the contention between the i_extra_isize update and
the inline data insertion, so move the xattr trylock in front of
i_extra_isize update.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.
Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When an xattr block has a single reference, block is updated inplace
and it is reinserted to the cache. Later, a cache lookup is performed
to see whether an existing block has the same contents. This cache
lookup will most of the time return the just inserted entry so
deduplication is not achieved.
Running the following test script will produce two xattr blocks which
can be observed in "File ACL: " line of debugfs output:
mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G
mount /dev/sdb /mnt/sdb
touch /mnt/sdb/{x,y}
setfattr -n user.1 -v aaa /mnt/sdb/x
setfattr -n user.2 -v bbb /mnt/sdb/x
setfattr -n user.1 -v aaa /mnt/sdb/y
setfattr -n user.2 -v bbb /mnt/sdb/y
debugfs -R 'stat x' /dev/sdb | cat
debugfs -R 'stat y' /dev/sdb | cat
This patch defers the reinsertion to the cache so that we can locate
other blocks with the same contents.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
For Lustre, if ea_inode fails in hash validation but passes parent
inode and generation checks, it won't be added to the cache as well
as the error "-EFSCORRUPTED" should be cleared, otherwise it will
cause "Structure needs cleaning" when running getfattr command.
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-9723
Cc: stable@vger.kernel.org
Fixes: dec214d00e
Signed-off-by: Emoly Liu <emoly.liu@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: tahsin@google.com
ea_inode feature allows creating extended attributes that are up to
64k in size. Update __ext4_new_inode() to pick increased credit limits.
To avoid overallocating too many journal credits, update
__ext4_xattr_set_credits() to make a distinction between xattr create
vs update. This helps __ext4_new_inode() because all attributes are
known to be new, so we can save credits that are normally needed to
delete old values.
Also, have fscrypt specify its maximum context size so that we don't
end up allocating credits for 64k size.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The main purpose of mb cache is to achieve deduplication in
extended attributes. In use cases where opportunity for deduplication
is unlikely, it only adds overhead.
Add a mount option to explicitly turn off mb cache.
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
To verify that a xattr entry is not pointing to the wrong xattr inode,
we currently check that the target inode has EXT4_EA_INODE_FL flag set and
also the entry size matches the target inode size.
For stronger validation, also incorporate crc32c hash of the value into
the e_hash field. This is done regardless of whether the entry lives in
the inode body or external attribute block.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When an extended attribute block is modified, ext4_xattr_hash_entry()
recalculates e_hash for the entry that is pointed by s->here. This is
unnecessary if the modification is to remove an entry.
Currently, if the removed entry is the last one and there are other
entries remaining, hash calculation targets the just erased entry which
has been filled with zeroes and effectively does nothing. If the removed
entry is not the last one and there are more entries, this time it will
recalculate hash on the next entry which is totally unnecessary.
Fix these by moving the decision on when to recalculate hash to
ext4_xattr_set_entry().
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
New ea_inode feature allows putting large xattr values into external
inodes. struct ext4_xattr_entry and the attribute name however have to
remain in the inode extra space or external attribute block. Once that
space is exhausted, no further entries can be added. Some of that space
could also be used by values that fit in there at the time of addition.
So, a single xattr entry whose value barely fits in the external block
could prevent further entries being added.
To mitigate the problem, this patch introduces a notion of reserved
space in the external attribute block that cannot be used by value data.
This reserve is enforced when ea_inode feature is enabled. The amount
of reserve is arbitrarily chosen to be min(block_size/8, 1024). The
table below shows how much space is reserved for each block size and the
guaranteed mininum number of entries that can be placed in the external
attribute block.
block size reserved bytes entries (name length = 16)
1k 128 3
2k 256 7
4k 512 15
8k 1024 31
16k 1024 31
32k 1024 31
64k 1024 31
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 ea_inode feature allows storing xattr values in external inodes to
be able to store values that are bigger than a block in size. Ext4 also
has deduplication support for these type of inodes. With deduplication,
the actual storage waste is eliminated but the users of such inodes are
still charged full quota for the inodes as if there was no sharing
happening in the background.
This design requires ext4 to manually charge the users because the
inodes are shared.
An implication of this is that, if someone calls chown on a file that
has such references we need to transfer the quota for the file and xattr
inodes. Current dquot_transfer() function implicitly transfers one inode
charge. With ea_inode feature, we would like to transfer multiple inode
charges.
Add get_inode_usage callback which can interrogate the total number of
inodes that were charged for a given inode.
[ Applied fix from Colin King to make sure the 'ret' variable is
initialized on the successful return path. Detected by
CoverityScan, CID#1446616 ("Uninitialized scalar variable") --tytso]
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jan Kara <jack@suse.cz>
Ext4 now supports xattr values that are up to 64k in size (vfs limit).
Large xattr values are stored in external inodes each one holding a
single value. Once written the data blocks of these inodes are immutable.
The real world use cases are expected to have a lot of value duplication
such as inherited acls etc. To reduce data duplication on disk, this patch
implements a deduplicator that allows sharing of xattr inodes.
The deduplication is based on an in-memory hash lookup that is a best
effort sharing scheme. When a xattr inode is read from disk (i.e.
getxattr() call), its crc32c hash is added to a hash table. Before
creating a new xattr inode for a value being set, the hash table is
checked to see if an existing inode holds an identical value. If such an
inode is found, the ref count on that inode is incremented. On value
removal the ref count is decremented and if it reaches zero the inode is
deleted.
The quota charging for such inodes is manually managed. Every reference
holder is charged the full size as if there was no sharing happening.
This is consistent with how xattr blocks are also charged.
[ Fixed up journal credits calculation to handle inline data and the
rare case where an shared xattr block can get freed when two thread
race on breaking the xattr block sharing. --tytso ]
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
During inode deletion, the number of journal credits that will be
needed is hard to determine. For that reason we have journal
extend/restart calls in several places. Whenever a transaction is
restarted, filesystem must be in a consistent state because there is
no atomicity guarantee beyond a restart call.
Add ext4_xattr_ensure_credits() helper function which takes care of
journal extend/restart logic. It also handles getting jbd2 write
access and dirty metadata calls. This function is called at every
iteration of handling an ea_inode reference.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There will be a second mb_cache instance that tracks ea_inodes. Make
existing names more explicit so that it is clear that they refer to
xattr block cache.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Make names more generic so that mbcache usage is not limited to
block sharing. In a subsequent patch in the series
("ext4: xattr inode deduplication"), we start using the mbcache code
for sharing xattr inodes. With that patch, old mb_cache_entry.e_block
field could be holding either a block number or an inode number.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tracking struct inode * rather than the inode number eliminates the
repeated ext4_xattr_inode_iget() call later. The second call cannot
fail in practice but still requires explanation when it wants to ignore
the return value. Avoid the trouble and make things simple.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Both ext4_set_acl() and ext4_set_context() need to be made aware of
ea_inode feature when it comes to credits calculation.
Also add a sufficient credits check in ext4_xattr_set_handle() right
after xattr write lock is grabbed. Original credits calculation is done
outside the lock so there is a possiblity that the initially calculated
credits are not sufficient anymore.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In a few places the function returns without trying to pass the actual
error code to the caller. Fix those.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When value size is <= EXT4_XATTR_MIN_LARGE_EA_SIZE(), and it
doesn't fit in either inline or xattr block, a second try is made to
store it in an external inode while storing the entry itself in inline
area. There should also be an attempt to store the entry in xattr block.
This patch adds a retry loop to do that. It also makes the caller the
sole decider on whether to store a value in an external inode.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When there is no space for a value in xattr block, it may be stored
in an xattr inode even if the value length is less than
EXT4_XATTR_MIN_LARGE_EA_SIZE(). So the current assumption in credits
calculation is wrong.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When a xattr entry refers to an external inode, the value data is not
available in the inline area so we should not attempt to read it using
value offset.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When moving xattr entries from inline area to a xattr block, entries
that refer to external xattr inodes need special handling because
value data is not available in the inline area but rather should be
read from its external inode.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_make_inode_space() is interested in calculating the inline
space used in an inode. When a xattr entry refers to an external inode
the value size indicates the external inode size, not the value size in
the inline area. Change the function to take this into account.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_value_same() is used as a quick optimization in case the new
xattr value is identical to the previous value. When xattr value is
stored in a xattr inode the check becomes expensive so it is better to
just assume that they are not equal.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Two places in code missed converting xattr inode number using
le32_to_cpu().
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The input and output values of *size parameter are equal on successful
return from ext4_xattr_inode_get(). On error return, the callers ignore
the output value so there is no need to update it.
Also check for NULL return from ext4_bread(). If the actual xattr inode
size happens to be smaller than the expected size, ext4_bread() may
return NULL which would indicate data corruption.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In general, kernel functions indicate success/failure through their return
values. This function returns the status as an output parameter and reserves
the return value for the inode. Make it follow the general convention.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The ref count on ea_inode is incremented by
ext4_xattr_inode_orphan_add() which is supposed to be decremented by
ext4_xattr_inode_array_free(). The decrement is conditioned on whether
the ea_inode is currently on the orphan list. However, the orphan list
addition only happens when journaling is enabled. In non-journaled case,r
we fail to release the ref count causing an error message like below.
"VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.
Have a nice day..."
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Quota charging is based on the ownership of the inode. Currently, the
xattr inode owner is set to the caller which may be different from the
parent inode owner. This is inconsistent with how quota is charged for
xattr block and regular data block writes.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We don't need acls on xattr inodes because they are not directly
accessible from user mode.
Besides lockdep complains about recursive locking of xattr_sem as seen
below.
=============================================
[ INFO: possible recursive locking detected ]
4.11.0-rc8+ #402 Not tainted
---------------------------------------------
python/1894 is trying to acquire lock:
(&ei->xattr_sem){++++..}, at: [<ffffffff804878a6>] ext4_xattr_get+0x66/0x270
but task is already holding lock:
(&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&ei->xattr_sem);
lock(&ei->xattr_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by python/1894:
#0: (sb_writers#10){.+.+.+}, at: [<ffffffff803d829f>] mnt_want_write+0x1f/0x50
#1: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803dda27>] vfs_setxattr+0x57/0xb0
#2: (&ei->xattr_sem){++++..}, at: [<ffffffff80489500>] ext4_xattr_set_handle+0xa0/0x5d0
stack backtrace:
CPU: 0 PID: 1894 Comm: python Not tainted 4.11.0-rc8+ #402
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x67/0x99
__lock_acquire+0x5f3/0x1830
lock_acquire+0xb5/0x1d0
down_read+0x2f/0x60
ext4_xattr_get+0x66/0x270
ext4_get_acl+0x43/0x1e0
get_acl+0x72/0xf0
posix_acl_create+0x5e/0x170
ext4_init_acl+0x21/0xc0
__ext4_new_inode+0xffd/0x16b0
ext4_xattr_set_entry+0x5ea/0xb70
ext4_xattr_block_set+0x1b5/0x970
ext4_xattr_set_handle+0x351/0x5d0
ext4_xattr_set+0x124/0x180
ext4_xattr_user_set+0x34/0x40
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x69/0x1c0
vfs_setxattr+0xa2/0xb0
setxattr+0x129/0x160
path_setxattr+0x87/0xb0
SyS_setxattr+0xf/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Setting a large xattr value may require writing the attribute contents
to an external inode. In this case we may need to lock the xattr inode
along with the parent inode. This doesn't pose a deadlock risk because
xattr inodes are not directly visible to the user and their access is
restricted.
Assign a lockdep subclass to xattr inode's lock.
============================================
WARNING: possible recursive locking detected
4.12.0-rc1+ #740 Not tainted
--------------------------------------------
python/1822 is trying to acquire lock:
(&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff804912ca>] ext4_xattr_set_entry+0x65a/0x7b0
but task is already holding lock:
(&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&sb->s_type->i_mutex_key#15);
lock(&sb->s_type->i_mutex_key#15);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by python/1822:
#0: (sb_writers#10){.+.+.+}, at: [<ffffffff803d0eef>] mnt_want_write+0x1f/0x50
#1: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffff803d6687>] vfs_setxattr+0x57/0xb0
#2: (jbd2_handle){.+.+..}, at: [<ffffffff80493f40>] start_this_handle+0xf0/0x420
#3: (&ei->xattr_sem){++++..}, at: [<ffffffff804920ba>] ext4_xattr_set_handle+0x9a/0x4f0
stack backtrace:
CPU: 0 PID: 1822 Comm: python Not tainted 4.12.0-rc1+ #740
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x67/0x9e
__lock_acquire+0x5f3/0x1750
lock_acquire+0xb5/0x1d0
down_write+0x2c/0x60
ext4_xattr_set_entry+0x65a/0x7b0
ext4_xattr_block_set+0x1b2/0x9b0
ext4_xattr_set_handle+0x322/0x4f0
ext4_xattr_set+0x144/0x1a0
ext4_xattr_user_set+0x34/0x40
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x69/0x1c0
vfs_setxattr+0xa2/0xb0
setxattr+0x12e/0x150
path_setxattr+0x87/0xb0
SyS_setxattr+0xf/0x20
entry_SYSCALL_64_fastpath+0x18/0xad
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Large xattr support is implemented for EXT4_FEATURE_INCOMPAT_EA_INODE.
If the size of an xattr value is larger than will fit in a single
external block, then the xattr value will be saved into the body
of an external xattr inode.
The also helps support a larger number of xattr, since only the headers
will be stored in the in-inode space or the single external block.
The inode is referenced from the xattr header via "e_value_inum",
which was formerly "e_value_block", but that field was never used.
The e_value_size still contains the xattr size so that listing
xattrs does not need to look up the inode if the data is not accessed.
struct ext4_xattr_entry {
__u8 e_name_len; /* length of name */
__u8 e_name_index; /* attribute name index */
__le16 e_value_offs; /* offset in disk block of value */
__le32 e_value_inum; /* inode in which value is stored */
__le32 e_value_size; /* size of attribute value */
__le32 e_hash; /* hash value of name and value */
char e_name[0]; /* attribute name */
};
The xattr inode is marked with the EXT4_EA_INODE_FL flag and also
holds a back-reference to the owning inode in its i_mtime field,
allowing the ext4/e2fsck to verify the correct inode is accessed.
[ Applied fix by Dan Carpenter to avoid freeing an ERR_PTR. ]
Lustre-Jira: https://jira.hpdd.intel.com/browse/LU-80
Lustre-bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=4424
Signed-off-by: Kalpak Shah <kalpak.shah@sun.com>
Signed-off-by: James Simmons <uja.ornl@gmail.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
ext4_xattr_block_set() calls dquot_alloc_block() to charge for an xattr
block when new references are made. However if dquot_initialize() hasn't
been called on an inode, request for charging is effectively ignored
because ext4_inode_info->i_dquot is not initialized yet.
Add dquot_initialize() to call paths that lead to ext4_xattr_block_set().
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
ext4_xattr_check_entry() was redundant with validation of the full xattr
entries list in ext4_xattr_check_entries(), which all callers also did.
ext4_xattr_check_entry() also didn't actually do correct validation;
specifically, it never checked that the value doesn't overlap the xattr
names, nor did it account for padding when checking whether the xattr
value overflows the available space. So remove it to eliminate any
potential confusion.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_xattr_check_names() actually validates both the xattr names and
values, not just the names. So rename it to ext4_xattr_check_entries()
to avoid confusion.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's no difference between ext4_xattr_list() and ext4_listxattr(), so
merge them together and just have ext4_listxattr(). Some years ago they
took different arguments, but that's no longer the case.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We must lock the xattr block before calculating or verifying the
checksum in order to avoid spurious checksum failures.
https://bugzilla.kernel.org/show_bug.cgi?id=193661
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
It was possible for an xattr value to have a very large size, which
would then pass validation on 32-bit architectures due to a pointer
wraparound. Fix this by validating the size in a way which avoids
pointer wraparound.
It was also possible that a value's size would fit in the available
space but its padded size would not. This would cause an out-of-bounds
memory write in ext4_xattr_set_entry when replacing the xattr value.
For example, if an xattr value of unpadded size 253 bytes went until the
very end of the inode or block, then using setxattr(2) to replace this
xattr's value with 256 bytes would cause a write to the 3 bytes past the
end of the inode or buffer, and the new xattr value would be incorrectly
truncated. Fix this by requiring that the padded size fit in the
available space rather than the unpadded size.
This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With i_extra_isize equal to or close to the available space, it was
possible for us to read past the end of the inode when trying to detect
or validate in-inode xattrs. Fix this by checking for the needed extra
space first.
This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.
current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().
Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Recent commits require line continuing printks to always use
pr_cont or KERN_CONT. Add these markings to a few more printks.
Miscellaneaous:
o Integrate the ea_idebug and ea_bdebug macros to use a single
call to printk(KERN_DEBUG instead of 3 separate printks
o Use the more common varargs macro style
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
It should be cpu_to_le32(), not le32_to_cpu(). No change in behavior.
Found with sparse, and this was the only endianness warning in fs/ext4/.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Move loop to make enough space in the inode from
ext4_expand_extra_isize_ea() into a separate function to make that
function smaller and better readable and also to avoid delaration of
variables inside a loop block.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
'start' variable is completely unused in ext4_expand_extra_isize_ea().
Variable 'first' is used only once in one place. So just remove them.
Variables 'entry' and 'last' are only really used later in the function
inside a loop. Move their declarations there.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out function for moving xattrs from inode into external xattr
block from ext4_expand_extra_isize_ea(). That function is already quite
long and factoring out this rather standalone functionality helps
readability.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We were checking whether computed offsets do not exceed end of block in
ext4_xattr_shift_entries(). However this does not make sense since we
always only decrease offsets. So replace that assertion with a check
whether we really decrease xattrs value offsets.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we don't support xattrs with e_value_block set. We don't allow
them to pass initial xattr check so there's no point for checking for
this later. Since these tests were untested, bugs were creeping in and
not all places which should have checked were checking e_value_block
anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we don't support xattrs with values stored out of line. Check
for that in ext4_xattr_check_names() to make sure we never work with
such xattrs since not all the code counts with that resulting is possible
weird corruption issues.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Conditions checking whether there is enough free space in an xattr block
and when xattr is large enough to make enough space in the inode forgot
to account for the fact that inode need not be completely filled up with
xattrs. Thus we could move unnecessarily many xattrs out of inode or
even falsely claim there is not enough space to expand the inode. We
also forgot to update the amount of free space in xattr block when moving
more xattrs and thus could decide to move too big xattr resulting in
unexpected failure.
Fix these problems by properly updating free space in the inode and
xattr block as we move xattrs. To simplify the math, avoid shifting
xattrs after removing each one xattr and instead just shift xattrs only
once there is enough free space in the inode.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we need to move xattrs into external xattr block, we call
ext4_xattr_block_set() from ext4_expand_extra_isize_ea(). That may end
up calling ext4_mark_inode_dirty() again which will recurse back into
the inode expansion code leading to deadlocks.
Protect from recursion using EXT4_STATE_NO_EXPAND inode flag and move
its management into ext4_expand_extra_isize_ea() since its manipulation
is safe there (due to xattr_sem) from possible races with
ext4_xattr_set_handle() which plays with it as well.
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We did not count with the padding of xattr value when computing desired
shift of xattrs in the inode when expanding i_extra_isize. As a result
we could create unaligned start of inline xattrs. Account for alignment
properly.
CC: stable@vger.kernel.org # 4.4.x-
Signed-off-by: Jan Kara <jack@suse.cz>
When multiple xattrs need to be moved out of inode, we did not properly
recompute total size of xattr headers in the inode and the new header
position. Thus when moving the second and further xattr we asked
ext4_xattr_shift_entries() to move too much and from the wrong place,
resulting in possible xattr value corruption or general memory
corruption.
CC: stable@vger.kernel.org # 4.4.x
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The code in ext4_expand_extra_isize_ea() treated new_extra_isize
argument sometimes as the desired target i_extra_isize and sometimes as
the amount by which we need to grow current i_extra_isize. These happen
to coincide when i_extra_isize is 0 which used to be the common case and
so nobody noticed this until recently when we added i_projid to the
inode and so i_extra_isize now needs to grow from 28 to 32 bytes.
The result of these bugs was that we sometimes unnecessarily decided to
move xattrs out of inode even if there was enough space and we often
ended up corrupting in-inode xattrs because arguments to
ext4_xattr_shift_entries() were just wrong. This could demonstrate
itself as BUG_ON in ext4_xattr_shift_entries() triggering.
Fix the problem by introducing new isize_diff variable and use it where
appropriate.
CC: stable@vger.kernel.org # 4.4.x
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
We aren't checking to see if the in-inode extended attribute is
corrupted before we try to expand the inode's extra isize fields.
This can lead to potential crashes caused by the BUG_ON() check in
ext4_xattr_shift_entries().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
To reduce amount of damage caused by single bad block, we limit number
of inodes sharing an xattr block to 1024. Thus there can be more xattr
blocks with the same contents when there are lots of files with the same
extended attributes. These xattr blocks naturally result in hash
collisions and can form long hash chains and we unnecessarily check each
such block only to find out we cannot use it because it is already
shared by too many inodes.
Add a reusable flag to cache entries which is cleared when a cache entry
has reached its maximum refcount. Cache entries which are not marked
reusable are skipped by mb_cache_entry_find_{first,next}. This
significantly speeds up mbcache when there are many same xattr blocks.
For example for xattr-bench with 5 values and each process handling
20000 files, the run for 64 processes is 25x faster with this patch.
Even for 8 processes the speedup is almost 3x. We have also verified
that for situations where there is only one xattr block of each kind,
the patch doesn't have a measurable cost.
[JK: Remove handling of setting the same value since it is not needed
anymore, check for races in e_reusable setting, improve changelog,
add measurements]
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When someone tried to set xattr to the same value (i.e., not changing
anything) we did all the work of removing original xattr, possibly
breaking references to shared xattr block, inserting new xattr, and
merging xattr blocks again. Since this is not so rare operation and it
is relatively cheap for us to detect this case, check for this and
shortcut xattr setting in that case.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since old mbcache code is gone, let's rename new code to mbcache since
number 2 is now meaningless. This is just a mechanical replacement.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
Change the list operation to only return whether or not an attribute
should be listed. Copying the attribute names into the buffer is moved
to the callers.
Since the result only depends on the dentry and not on the attribute
name, we do not pass the attribute name to list operations.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The xattr_handler operations are currently all passed a file system
specific flags value which the operations can use to disambiguate between
different handlers; some file systems use that to distinguish the xattr
namespace, for example. In some oprations, it would be useful to also have
access to the handler prefix. To allow that, pass a pointer to the handler
to operations instead of the flags value alone.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In this if statement, the previous condition is useless, the later one
has covered it.
Signed-off-by: Weiyuan <weiyuan.wei@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Remove unused header files and header files which are included in
ext4.h.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized csum machinery.
#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END
# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)
https://bugzilla.kernel.org/show_bug.cgi?id=82201
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
When loading extended attributes, check each entry's value offset to
make sure it doesn't collide with the entries.
Without this check it is easy to crash the kernel by mounting a
malicious FS containing a file with an EA wherein e_value_offs = 0 and
e_value_size > 0 and then deleting the EA, which corrupts the name
list.
(See the f_ea_value_crash test's FS image in e2fsprogs for an example.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().
Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.
This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything. If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>