For subpage case, grab_extent_buffer() can't really get an extent buffer
just from btrfs_subpage.
We have radix tree lock protecting us from inserting the same eb into
the tree. Thus we don't really need to do the extra hassle, just let
alloc_extent_buffer() handle the existing eb in radix tree.
Now if two ebs are being allocated as the same time, one will fail with
-EEIXST when inserting into the radix tree.
So for grab_extent_buffer(), just always return NULL for subpage case.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage case, we need to allocate additional memory for each
metadata page.
So we need to:
- Allow attach_extent_buffer_page() to return int to indicate allocation
failure
- Allow manually pre-allocate subpage memory for alloc_extent_buffer()
As we don't want to use GFP_ATOMIC under spinlock, we introduce
btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
purpose.
(The simple wrap for btrfs_free_subpage() is for later convert to
kmem_cache. Already internally tested without problem)
- Preallocate btrfs_subpage structure for alloc_extent_buffer()
We don't want to call memory allocation with spinlock held, so
do preallocation before we acquire mapping->private_lock.
- Handle subpage and regular case differently in
attach_extent_buffer_page()
For regular case, no change, just do the usual thing.
For subpage case, allocate new memory or use the preallocated memory.
For future subpage metadata, we will make use of radix tree to grab
extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the incoming subpage support, UNMAPPED extent buffer will have
different behavior in btrfs_release_extent_buffer().
This means we need to set UNMAPPED bit early before calling
btrfs_release_extent_buffer().
Currently there is only one caller which relies on
btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
late:
- btrfs_clone_extent_buffer()
Make it subpage compatible by setting the UNMAPPED bit early, since
we're here, also move the UPTODATE bit early.
There is another caller, __alloc_dummy_extent_buffer(), setting
UNMAPPED bit late, but that function clean up the allocated page
manually, thus no need for any modification.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
__process_pages_contig(), to let the function know to clear page dirty
bit and then set page writeback.
However page writeback and dirty bits are conflicting (at least for
sector size == PAGE_SIZE case), this means these two have to be always
updated together.
This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
PAGE_START_WRITEBACK.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Comment for processed extent end of range has an unnecessary "in",
remove it.
Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This makes the file W=1 clean and fixes the following warnings:
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit'
fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages'
fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages'
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will extract the code to grab an extent buffer from a page
into a helper, grab_extent_buffer_from_page().
This reduces one indent level, and provides the work place for later
expansion for subapge support.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The refactoring involves the following modifications:
- iosize alignment
In fact we don't really need to manually do alignment at all.
All extent maps should already be aligned, thus basic ASSERT() check
would be enough.
- redundant variables
We have extra variable like blocksize/pg_offset/end.
They are all unnecessary.
@blocksize can be replaced by sectorsize size directly, and it's only
used to verify the em start/size is aligned.
@pg_offset can be easily calculated using @cur and page_offset(page).
@end is just assigned from @page_end and never modified, use
"start + PAGE_SIZE - 1" directly and remove @page_end.
- remove some BUG_ON()s
The BUG_ON()s are for extent map, which we have tree-checker to check
on-disk extent data item and runtime check.
ASSERT() should be enough.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter offset is confusing, it's supposed to be the disk bytenr
of metadata/data. Rename it to disk_bytenr and update the comment.
Also rename each offset passed to submit_extent_page() as @disk_bytenr
so they're consistent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.
Pass tree->fs_info as it's known to be valid in all cases.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04 ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.
Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.
This patch introduces two helpers to do this:
- get_eb_page_index()
This is to calculate the index to access extent_buffer::pages.
It's just a simple wrapper around "start >> PAGE_SHIFT".
For sectorsize == PAGE_SIZE case, nothing is changed.
For sectorsize < PAGE_SIZE case, we always get index as 0, and
the existing page shift also works.
- get_eb_offset_in_page()
This is to calculate the offset to access extent_buffer::pages.
This needs to take extent_buffer::start into consideration.
For sectorsize == PAGE_SIZE case, extent_buffer::start is always
aligned to PAGE_SIZE, thus adding extent_buffer::start to
offset_in_page() won't change the result.
For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
us the correct offset to access.
This patch will touch the following parts to cover all extent buffer
accessors:
- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.
The 64K is selected because:
- we are only going to support 64K page size for subpage for now
- 64K is also the maximum supported node size
This ensures that tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.
Otherwise we would have to do complex multi-page handling of tree
blocks. Currently there is no way to create such tree blocks.
In kernel we have avoided such tree blocks allocation even on 4K page
size, as it can lead to RAID56 stripe scrubbing.
While btrfs-progs have fixed its chunk allocator since 2016 for convert,
and has extra checks to do the same behavior as the kernel.
Just add such graceful checks in case of an ancient filesystem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.
For a system using 64K page size, we would really have just one page.
While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.
Calculate the array size based on page size and the node size maximum.
- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page
Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btree_write_cache_pages() we have a btree page submission routine
buried deeply in a nested loop.
This patch will extract that part of code into a helper function,
submit_eb_page(), to do the same work.
Since submit_eb_page() now can return >0 for successful extent
buffer submission, remove the "ASSERT(ret <= 0);" line.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().
Both parameters are calculated values for csum lookup.
Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.
Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.
Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 72deb455b5 ("block: remove CONFIG_LBDAF") (5.2) the
sector_t type is u64 on all arches and configs so we don't need to
typecast it. It used to be unsigned long and the result of sector size
shifts were not guaranteed to fit in the type.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In alloc_extent_buffer(), after we got a page from btree inode, we check
if that page has private pointer attached.
If attached, we check if the existing extent buffer has proper refs.
If not (the eb is being freed), we will detach that private eb pointer.
The point here is, we are detaching that eb pointer by calling:
- ClearPagePrivate()
- put_page()
The put_page() here is especially confusing, as it's decreasing the ref
from attach_page_private(). Without knowing that, it looks like the
put_page() is for the find_or_create_page() call, confusing the reader.
Since we're always modifying page private with attach_page_private() and
detach_page_private(), the only open-coded detach_page_private() here is
really confusing.
Fix it by calling detach_page_private().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_invalidatepage() will try to clear all possible bits since it's
calling clear_extent_bit() with delete == 1.
This is currently fine, since for btree io tree, it only utilizes
EXTENT_LOCK bit. But this could be a problem for later subpage support,
which will utilize extra io tree bit to represent additional info.
This patch will just convert that clear_extent_bit() to
unlock_extent_cached().
For current code since only EXTENT_LOCKED bit is utilized, this doesn't
change the behavior, but provides a much cleaner basis for incoming
subpage support.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
@phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.
But for metadata, it's completely useless as metadata stores their own
csum in its header, so we can remove it.
Note: parameters @start and @end, they are not utilized at all for
current sectorsize == PAGE_SIZE case, as we can grab eb directly from
page.
But those two parameters are very important for later subpage support,
thus @start/@len are not touched here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the type is unsigned int which could change its width
depending on the architecture. We need up to 32 bits so make it
explicit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new helper to handle update page status in
end_bio_extent_readpage(). This will be later used for subpage support
where the page status update can be more complex than now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 direct calls to set_extent_bit outside of extent-io -
in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
thin wrappers around __set_extent_bit. This adds unnecessary indirection
and just makes it more annoying when looking at the various extent bit
manipulation functions. This patch renames __set_extent_bit to
set_extent_bit effectively removing a level of indirection. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat and remove __must_check ]
Signed-off-by: David Sterba <dsterba@suse.com>
It is unused everywhere now, it can be removed.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both Filipe and Fedora QA recently hit the following lockdep splat:
WARNING: possible recursive locking detected
5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted
--------------------------------------------
rsync/2610 is trying to acquire lock:
ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
but task is already holding lock:
ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&eb->lock);
lock(&eb->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by rsync/2610:
#0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
#1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
stack backtrace:
CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x8b/0xb0
__lock_acquire.cold+0x12d/0x2a4
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
lock_acquire+0xc8/0x400
? btrfs_tree_read_lock_atomic+0x34/0x140
? read_block_for_search.isra.0+0xdd/0x320
_raw_read_lock+0x3d/0xa0
? btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_search_slot+0x616/0x9a0
btrfs_lookup_dir_item+0x6c/0xb0
btrfs_lookup_dentry+0xa8/0x520
? lockdep_init_map_waits+0x4c/0x210
btrfs_lookup+0xe/0x30
__lookup_slow+0x10f/0x1e0
walk_component+0x11b/0x190
path_lookupat+0x72/0x1c0
filename_lookup+0x97/0x180
? strncpy_from_user+0x96/0x1e0
? getname_flags.part.0+0x45/0x1a0
vfs_statx+0x64/0x100
? lockdep_hardirqs_on_prepare+0xff/0x180
? _raw_spin_unlock_irqrestore+0x41/0x50
__do_sys_newlstat+0x26/0x40
? lockdep_hardirqs_on_prepare+0xff/0x180
? syscall_enter_from_user_mode+0x27/0x80
? syscall_enter_from_user_mode+0x27/0x80
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
I have also seen a report of lockdep complaining about the lock class
that was looked up being the same as the lock class on the lock we were
using, but I can't find the report.
These are problems that occur because we do not have the lockdep class
set on the extent buffer until _after_ we read the eb in properly. This
is problematic for concurrent readers, because we will create the extent
buffer, lock it, and then attempt to read the extent buffer.
If a second thread comes in and tries to do a search down the same path
they'll get the above lockdep splat because the class isn't set properly
on the extent buffer.
There was a good reason for this, we generally didn't know the real
owner of the eb until we read it, specifically in refcounted roots.
However now all refcounted roots have the same class name, so we no
longer need to worry about this. For non-refcounted trees we know
which root we're on based on the parent.
Fix this by setting the lockdep class on the eb at creation time instead
of read time. This will fix the splat and the weirdness where the class
changes in the middle of locking the block.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've plumbed all of the callers to have the owner root and the
level, plumb it down into alloc_extent_buffer().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to pass around more information when we allocate extent
buffers, in order to make that cleaner how we do readahead. Most of the
callers have the parent node that we're getting our blockptr from, with
the sole exception of relocation which simply has the bytenr it wants to
read.
Add a helper that takes the current arguments that we need (bytenr and
gen), and add another helper for simply reading the slot out of a node.
In followup patches the helper that takes all the extra arguments will
be expanded, and the simpler helper won't need to have it's arguments
adjusted.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage sector size support, one page can contain multiple tree
blocks. The entries cannot be based on page size and index must be
derived from the sectorsize. No change for page size == sector size.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When calling attach_extent_buffer_page(), either we're attaching
anonymous pages, called from btrfs_clone_extent_buffer(),
or we're attaching btree inode pages, called from alloc_extent_buffer().
For the latter case, we should hold page->mapping->private_lock to avoid
parallel changes to page->private.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer distinguish between blocking and spinning, so rip out all
this code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.
This simple change will shave some bytes on x86_64 and release config:
text data bss dec hex filename
1090000 17980 14912 1122892 11224c pre/btrfs.ko
1089794 17980 14912 1122686 11217e post/btrfs.ko
DELTA: -206
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.
Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The value of super_block::s_blocksize_bits is the same as
fs_info::sectorsize_bits, but we don't need to do the extra dereferences
in many functions and storing the bits as u32 (in fs_info) generates
shorter assembly.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @page_size in submit_extent_page() is not related to page
size.
It can already be smaller than PAGE_SIZE, so rename it to io_size to
reduce confusion, this is especially important for later subpage
support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're reading partial page, btrfs will warn about this as read/write
is always done in sector size, which now equals page size.
But for the upcoming subpage read-only support, our data read is only
aligned to sectorsize, which can be smaller than page size.
Thus here we change the warning condition to check it against
sectorsize, the behavior is not changed for regular sectorsize ==
PAGE_SIZE case, and won't report error for subpage read.
Also, pass the proper start/end with bv_offset for check_data_csum() to
handle.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function process_pages_contig() does not only handle page locking but
also other operations. Rename the local variable pages_locked to
pages_processed to reduce confusion.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @failed_start parameter is only paired with @exclusive_bits, and
those parameters are only used for EXTENT_LOCKED bit, which have their
own wrappers lock_extent_bits().
Thus for regular set_extent_bit() calls, the failed_start makes no
sense, just sink the parameter.
Also, since @failed_start and @exclusive_bits are used in pairs, add
an assert to make it obvious.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pitfall here is, if the parameter @bits has multiple bits set, we
will return the first range which just has one of the specified bits
set.
This is a little tricky if we want an exact match. Anyway, update the
comment to make that clear.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The return value of that function is completely wrong.
That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".
And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.
While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.
The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.
So the comment is completely wrong.
Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.
For reference, added in commit 2e3c25136a ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree. For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon. At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.
However now the opposite is the case. There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order. This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases. The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.
The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.
So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything. This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions. The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree. However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):
dbench 200 baseline
Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms
dbench 200 with patch
Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms
Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking
fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16
baseline
Average Files/sec: 160166.7
p50 Files/sec: 165832
p90 Files/sec: 123886
p99 Files/sec: 123495
real 3m26.527s
user 2m19.223s
sys 48m21.856s
patched
Average Files/sec: 164135.7
p50 Files/sec: 171095
p90 Files/sec: 122889
p99 Files/sec: 113819
real 3m29.660s
user 2m19.990s
sys 44m12.259s
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While using compression, a submitted bio is mapped with a compressed bio
which performs the read from disk, decompresses and returns uncompressed
data to original bio. The original bio must reflect the uncompressed
size (iosize) of the I/O to be performed, or else the page just gets the
decompressed I/O length of data (disk_io_size). The compressed bio
checks the extent map and gets the correct length while performing the
I/O from disk.
This came up in subpage work when only compressed length of the original
bio was filled in the page. This worked correctly for pagesize ==
sectorsize because both compressed and uncompressed data are at pagesize
boundaries, and would end up filling the requested page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.
Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No need to go through a function pointer indirection simply call
submit_bio_hook directly by exporting and renaming the helper to
btrfs_submit_metadata_bio. This makes the code more readable and should
result in somewhat faster code due to no longer paying the price for
specualtive attack mitigations that come with indirect function calls.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead export and rename the function to btrfs_submit_data_bio and
call it directly in submit_one_bio. This avoids paying the cost for
speculative attacks mitigations and improves code readability.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the is_data_inode helper.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
- ordinary data inodes (including the freespace inode) and the btree
inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
never be called from submit_one_bio so just remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't call readpage_end_io_hook for the btree inode. Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.
This is an improvement in 2 directions:
1. We aren't paying the penalty of indirect calls in a post-speculation
attacks world.
2. The function is now named more explicitly so it's obvious what's
going on
This is in preparation to removing struct extent_io_ops altogether.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This makes reading the code a tad easier by decreasing the level of
indirection by one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 by the 2 callers so move it inside __do_readpage.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's always set to 0 from the sole caller - btrfs_readpage.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_readpage is the only caller of extent_read_full_page the
latter can be open coded in the former. Use the occassion to rename
__extent_read_full_page to extent_read_full_page. To facillitate this
change submit_one_bio has to be exported as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's called only from btrfs_readpage which always passes 0 so just sink
the argument into extent_read_full_page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that this function is only responsible for reading data pages it's
no longer necessary to pass get_extent_t parameter across several
layers of functions. This patch removes this parameter from multiple
functions: __get_extent_map/__do_readpage/__extent_read_full_page/
extent_read_full_page and simply calls btrfs_get_extent directly in
__get_extent_map.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Metadata pages currently use __do_readpage to read metadata pages,
unfortunately this function is also used to deal with ordinary data
pages. This makes the metadata pages reading code to go through multiple
hoops in order to adhere to __do_readpage invariants. Most of these are
necessary for data pages which could be compressed. For metadata it's
enough to simply build a bio and submit it.
To this effect simply call submit_extent_page directly from
read_extent_buffer_pages which is the only callpath used to populate
extent_buffers with data. This in turn enables further cleanups.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's counterintuitive to have a function named btrfs_inode_xxx which
takes a generic inode. Also move the function to btrfs_inode.h so that
it has access to the definition of struct btrfs_inode.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nested locking with lockdep and everything else refers to lock hierarchy
within the same lock map. This is how we indicate the same locks for
different objects are ok to take in a specific order, for our use case
that would be to take the lock on a leaf and then take a lock on an
adjacent leaf.
What ->lock_nested _actually_ refers to is if we happen to already be
holding the write lock on the extent buffer and we're allowing a read
lock to be taken on that extent buffer, which is recursion. Rename this
so we don't get confused when we switch to a rwsem and have to start
using the _nested helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9NWskACgkQxWXV+ddt
WDsO5A//Xcjo45Th0T0mTiPXWpCPexOmQd3yvEezME5TVavEKM9hkfxYrv5d5i/k
dfFhPg8FYSasW7Sie1PdcP0Hu9mlla5G/4L1pQYMewCdJCSx0mt/dDtEh4GPwcmI
1MgyQ0pnKufNBD9SquO4yGZHR+lP+wY/FTZ6yCePJRvNEvtOWtcxH3vhTrbKC0sc
wSO2xL8Ht3VQBDA6YR+A/OMY8t6k2aN0qOk1BJWcMil0HmNw/rZBH/bWvPOn0dMQ
vaXQIUNRIUUD5qS7vcv5rgVDDWMnyM41eC6idckg1A2tpLGYZyFO3fv2c/VNKFaZ
dWQGeSbg0wRMnQhHhDmAZZwRfTsOhGuSPADnh4qQ07klDto6s5EbmrwkWmGNC6QY
B9yGF09mNwcHvgjz6RVv3mA/dINP32Qg3s90TtO2b7Rrolg1sQLRYDewuTUukmvH
jIg8q1oZBppHS8y5B4Vxl+18Swz4j66Kurq9jhU669n0CUqYBiqg5dEQzDHm54Ca
uEujNWfUGzfiFbj3uVtMG0i9XSHlwEtiGOCkXErlEqAFRSzPQsoFJQ0ae3Hl9FiI
PJuTVmS1UekhCB/ubewf5dOHbEk/nRKbIHnXbuJTQcWWzGtzzyBv4uJH3arQcMrC
QdiBpbqR1BSQ6ki4GbnG9f/4mvnLffsfPDAh+dt9VGxEru/9Q/c=
=cdvW
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two small fixes and a bunch of lockdep fixes for warnings that show up
with an upcoming tree locking update but are valid with current locks
as well"
* tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: fix the error message for transid error
btrfs: set the lockdep class for log tree extent buffers
btrfs: set the correct lockdep class for new nodes
btrfs: allocate scrub workqueues outside of locks
btrfs: fix potential deadlock in the search ioctl
btrfs: drop path before adding new uuid tree entry
btrfs: block-group: fix free-space bitmap threshold
With the conversion of the tree locks to rwsem I got the following
lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
------------------------------------------------------
compsize/11122 is trying to acquire lock:
ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
but task is already holding lock:
ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-fs-00){++++}-{3:3}:
down_write_nested+0x3b/0x70
__btrfs_tree_lock+0x24/0x120
btrfs_search_slot+0x756/0x990
btrfs_lookup_inode+0x3a/0xb4
__btrfs_update_delayed_inode+0x93/0x270
btrfs_async_run_delayed_root+0x168/0x230
btrfs_work_helper+0xd4/0x570
process_one_work+0x2ad/0x5f0
worker_thread+0x3a/0x3d0
kthread+0x133/0x150
ret_from_fork+0x1f/0x30
-> #1 (&delayed_node->mutex){+.+.}-{3:3}:
__mutex_lock+0x9f/0x930
btrfs_delayed_update_inode+0x50/0x440
btrfs_update_inode+0x8a/0xf0
btrfs_dirty_inode+0x5b/0xd0
touch_atime+0xa1/0xd0
btrfs_file_mmap+0x3f/0x60
mmap_region+0x3a4/0x640
do_mmap+0x376/0x580
vm_mmap_pgoff+0xd5/0x120
ksys_mmap_pgoff+0x193/0x230
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (&mm->mmap_lock#2){++++}-{3:3}:
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
__might_fault+0x68/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
other info that might help us debug this:
Chain exists of:
&mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-fs-00);
lock(&delayed_node->mutex);
lock(btrfs-fs-00);
lock(&mm->mmap_lock#2);
*** DEADLOCK ***
1 lock held by compsize/11122:
#0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
stack backtrace:
CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
dump_stack+0x78/0xa0
check_noncircular+0x165/0x180
__lock_acquire+0x1272/0x2310
lock_acquire+0x9e/0x360
? __might_fault+0x3e/0x90
? find_held_lock+0x72/0x90
__might_fault+0x68/0x90
? __might_fault+0x3e/0x90
_copy_to_user+0x1e/0x80
copy_to_sk.isra.32+0x121/0x300
? btrfs_search_forward+0x2a6/0x360
search_ioctl+0x106/0x200
btrfs_ioctl_tree_search_v2+0x7b/0xf0
btrfs_ioctl+0x106f/0x30a0
? __do_sys_newfstat+0x5a/0x70
? ksys_ioctl+0x83/0xc0
ksys_ioctl+0x83/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x50/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.
This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.
We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.
This patch is part of a patchset that consists of 3 patches, which have
the following subjects:
1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release
Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.
The following script, using fio, was used to perform the tests:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).
*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
After patchset:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)
*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
After patchset:
WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)
*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
After patchset:
WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)
*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
After patchset:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)
*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
After patchset:
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)
*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
After patchset:
WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)
*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
After patchset:
WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we never release an extent
map that is in the list of modified extents. This is to prevent races with
a concurrent fsync using the fast path, which could lead to not logging an
extent created in the current transaction.
However we can safely remove an extent map created in a past transaction
that is still in the list of modified extents (because no one fsynced yet
the inode after that transaction got commited), because such extents are
skipped during an fsync as it is pointless to log them. This change does
that.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes: ff44c6e36d ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Eric reported seeing this message while running generic/475
BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Full stack trace:
BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS: Transaction aborted (error -117)
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
Call Trace:
? finish_wait+0x90/0x90
? __mutex_unlock_slowpath+0x45/0x2a0
? lock_acquire+0xa3/0x440
? lockref_put_or_lock+0x9/0x30
? dput+0x20/0x4a0
? dput+0x20/0x4a0
? do_raw_spin_unlock+0x4b/0xc0
? _raw_spin_unlock+0x1f/0x30
btrfs_sync_file+0x335/0x490 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x50/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6a6ef1b6e3
Code: Bad RIP value.
RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace af146e0e38433456 ]---
BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
This ret came from btrfs_write_marked_extents(). If we get an aborted
transaction via EIO before, we'll see it in btree_write_cache_pages()
and return EUCLEAN, which gets printed as "Filesystem corrupted".
Except we shouldn't be returning EUCLEAN here, we need to be returning
EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
elsewhere, but we want to use EROFS for this particular case. The
original transaction abort has the real error code for why we ended up
with an aborted transaction, all subsequent actions just need to return
EROFS because they may not have a trans handle and have no idea about
the original cause of the abort.
After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
stacktrace will not be dumped either.
Reported-by: Eric Sandeen <esandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add full test stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is not common cleanup run after the label it makes it
somewhat redundant.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make the function directly return a pointer to a failure record and
adjust callers to handle it. Also refactor the logic inside so that
the case which allocates the failure record for the first time is not
handled in an 'if' arm, saving us a level of indentation. Finally make
the function static as it's not used outside of extent_io.c .
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only failure that get_state_failrec can get is if there is no failure
for the given address. There is no reason why the function should return
a status code and use a separate parameter for returning the actual
failure rec (if one is found). Simplify it by making the return type
a pointer and return ERR_PTR value in case of errors.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only find_lock_delalloc_range uses vfs_inode so let's take the
btrfs_inode as a parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has only a single use for a generic vfs inode vs 3 for btrfs_inode.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All children now take btrfs_inode so convert it to taking it as a
parameter as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fiemap callback is not part of UAPI interface and the prototypes
don't have the __u64 types either.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It has one VFS and 1 btrfs inode usages but converting it to btrfs_inode
interface will allow seamless conversion of its callers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8auzgACgkQxWXV+ddt
WDv0CRAAooFO+hloV+br40eEfJwZJJk+iIvc3tyq3TRUrmt1D0G4F7nUtiHjb8JU
ch2HK+GNZkIK4747OCgcFREpYZV2m0hrKybzf/j4mYb7OXzHmeHTMfGVut1g80e7
dlpvP7q4VZbBP8BTo/8wqdSAdCUiNhLFy5oYzyUwyflJ5S8FpjY+3dXIRHUnhxPU
lxMANWhX9y/qQEceGvxqwqJBiYT6WI7dwONiULc1klWDIug/2BGZQR0WuC5PVr0G
YNuxcEU6rluWzKWJ5k3104t+N1Nc5+xglIgBLeLKAyTVYq8zAMf+P8bBPnQ3QDkV
zniNIH9ND8tYSjmGkmO0ltExFrE2o9NRnjapOFXfB0WGXee5LfzFfzd5Hk9YV+Ua
bs98VNGR4B12Iw++DvrbhbFAMxBHiBfAX/O44xJ81uAYVUs21OfefxHWrLzTJK+1
xYfiyfCDxZDGpC/weg9GOPcIZAzzoSAvqDqWHyWY5cCZdB60RaelGJprdG5fP/gA
Y+hDIdutVXMHfhaX0ktWsDvhPRXcC7MT0bjasljkN5WUJ/xZZQr6QmgngY+FA8G/
0n/dv0pYdOTK/8YVZAMO+VklzrDhziqzc2sBrH1k3MA9asa/Ls5v+r2PU+qBKZJm
cBJGtxxsx72CHbkIhtd5oGj5LNTXFdXeHph37ErzW3ajeamO4X0=
=51h/
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into master
Pull btrfs fixes from David Sterba:
"A few resouce leak fixes from recent patches, all are stable material.
The problems have been observed during testing or have a reproducer"
* tag 'for-5.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix mount failure caused by race with umount
btrfs: fix page leaks after failure to lock page for delalloc
btrfs: qgroup: fix data leak caused by race between writeback and truncate
btrfs: fix double free on ulist after backref resolution failure
When locking pages for delalloc, we check if it's dirty and mapping still
matches. If it does not match, we need to return -EAGAIN and release all
pages. Only the current page was put though, iterate over all the
remaining pages too.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl8EdTkACgkQxWXV+ddt
WDv6xA/9Hguo/k6oj/7Nl9n3UUZ7gp44R/jy37fhMuNcwuEDuqIEfAgGXupdJVaj
pYDorUMRUQfI2yLB1iHAnPgBMKBidSroDsdrRHKuimnhABSO2/KX/KXPianIIRGi
wPvqZR04L565LNpRlDQx7OYkJWey7b6xf47UZqDglivnKY1OwCJlXgfCj/9FApr0
Y+PVlgEU78ExTeAHs/h8ofZ/f5T2eqiluBSFVykzCg1NngaQVOKpN3gnWEatUAvM
ekm6U4E1ZR9oOprdhlf6V96ztGzVTRKB1vFIeCvJLqLNIe+0pxlRfRn2aOj8vzEO
DRjgOlhyAIgypp78SwCspjhvejvVneSFdEGSVvHOw1ombB//OJ1qBb5G/lIcwCj3
PZ3OnQJV7+/Ty7Xt/X26W841zvnu90K0di0CsOPehtbkgkR4txgHCJB9mSlsMugN
awN5Ryy1rw1cAM5GspXG9EEOvJmnSizQf4BcK649IG5eUKThYYLc5mp68jiMljs0
NHFPg5P4yTRjk7Yqgxq5VvTPLLJo5j5xxqtY/1zDWuguRa40wIoy/JUJaJoPg9Vd
221/qRG4R4xGyZXGx6XTiWK+M3qjTlS9My9tGoWygwlExRkr7Uli9Ikef3U0tBoF
bjTcfCNOuCp+JECHNcnMZ9fhhFaMwIL1V4OflB1iicBAtXxo8Lk=
=+4BZ
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix of a leak in global block reserve accounting
- fix a (hard to hit) race of readahead vs releasepage that could lead
to crash
- convert all remaining uses of comment fall through annotations to the
pseudo keyword
- fix crash when mounting a fuzzed image with -o recovery
* tag 'for-5.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reset tree root pointer after error in init_tree_roots
btrfs: fix reclaim_size counter leak after stealing from global reserve
btrfs: fix fatal extent_buffer readahead vs releasepage race
btrfs: convert comments to fallthrough annotations
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl7U50AACgkQxWXV+ddt
WDtK1g//RXeNsTguYQr1N9R5eUPThjLEI0+4J0l4SYfCPU8Ou3C7nqpOEJJQgm8F
ezZE+16cWi9U5uGueOc+w0rfyz4AuIXKgzoz+c0/GG2+yV5jp6DsAMbWqojAb96L
V/N3HxEzR66jqwgVUBE/x5okb2SyY7//B1l/O0amc66XDO7KTMImpIwThere6zWZ
o2SNpYpHAPQeUYJQx8h+FAW3w1CxrCZmnifazU9Jqe9J7QeQLg7rbUlJDV38jySm
ZOA8ohKN9U1gPZy+dTU3kdyyuBIq1etkIaSPJANyTo5TczPKiC0IMg75cXtS4ae/
NSxhccMpSIjVMcIHARzSFGYKNP3sGNRsmaTUg/2Cx/9GoHOhYMiCAVc8qtBBpwJO
UI0siexrCe64RuTBMRRc128GdFv7IjmSImcdi8xaR62bCcUiNdEa3zvjRe/9tOEH
ET7Z85oBnKpSzpC3MdhSUU4dtHY5XLawP8z3oUU1VSzSWM2DVjlHf79/VzbOfp18
miCVpt94lCn/gUX7el6qcnbuvMAjDyeC6HmfD+TwzQgGwyV6TLgKN9lRXeH/Oy6/
VgjGQSavGHMll3zIGURmrBCXKudjJg0J+IP4wN1TimmSEMfwKH+7tnekQd8y5qlF
eXEIqlWNykKeDzEnmV9QJy+/cV83hVWM/mUslcTx39tLN/3B/Us=
=qTt8
-----END PGP SIGNATURE-----
Merge tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Highlights:
- speedup dead root detection during orphan cleanup, eg. when there
are many deleted subvolumes waiting to be cleaned, the trees are
now looked up in radix tree instead of a O(N^2) search
- snapshot creation with inherited qgroup will mark the qgroup
inconsistent, requires a rescan
- send will emit file capabilities after chown, this produces a
stream that does not need postprocessing to set the capabilities
again
- direct io ported to iomap infrastructure, cleaned up and simplified
code, notably removing last use of struct buffer_head in btrfs code
Core changes:
- factor out backreference iteration, to be used by ordinary
backreferences and relocation code
- improved global block reserve utilization
* better logic to serialize requests
* increased maximum available for unlink
* improved handling on large pages (64K)
- direct io cleanups and fixes
* simplify layering, where cloned bios were unnecessarily created
for some cases
* error handling fixes (submit, endio)
* remove repair worker thread, used to avoid deadlocks during
repair
- refactored block group reading code, preparatory work for new type
of block group storage that should improve mount time on large
filesystems
Cleanups:
- cleaned up (and slightly sped up) set/get helpers for metadata data
structure members
- root bit REF_COWS got renamed to SHAREABLE to reflect the that the
blocks of the tree get shared either among subvolumes or with the
relocation trees
Fixes:
- when subvolume deletion fails due to ENOSPC, the filesystem is not
turned read-only
- device scan deals with devices from other filesystems that changed
ownership due to overwrite (mkfs)
- fix a race between scrub and block group removal/allocation
- fix long standing bug of a runaway balance operation, printing the
same line to the syslog, caused by a stale status bit on a reloc
tree that prevented progress
- fix corrupt log due to concurrent fsync of inodes with shared
extents
- fix space underflow for NODATACOW and buffered writes when it for
some reason needs to fallback to COW mode"
* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
btrfs: fix space_info bytes_may_use underflow during space cache writeout
btrfs: fix space_info bytes_may_use underflow after nocow buffered write
btrfs: fix wrong file range cleanup after an error filling dealloc range
btrfs: remove redundant local variable in read_block_for_search
btrfs: open code key_search
btrfs: split btrfs_direct_IO to read and write part
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
fs: remove dio_end_io()
btrfs: switch to iomap_dio_rw() for dio
iomap: remove lockdep_assert_held()
iomap: add a filesystem hook for direct I/O bio submission
fs: export generic_file_buffered_read()
btrfs: turn space cache writeout failure messages into debug messages
btrfs: include error on messages about failure to write space/inode caches
btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
btrfs: make checksum item extension more efficient
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
btrfs: unexport btrfs_compress_set_level()
btrfs: simplify iget helpers
...
Since the new pair function is introduced, we can call them to clean the
code in btrfs.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Link: http://lkml.kernel.org/r/20200517214718.468-4-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement the new readahead method in btrfs using the new
readahead_page_batch() function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Link: http://lkml.kernel.org/r/20200414150233.24495-18-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Helpers that iterate over extent buffer pages set up several variables,
one of them is finding out offset of the extent buffer start within a
page. Right now we have extent buffers aligned to page sizes so this is
effectively storing zero. This makes the code harder the follow and can
be simplified.
The same change is done in all the helpers:
* remove: size_t start_offset = offset_in_page(eb->start);
* simplify code using start_offset
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many helpers around extent buffers, found in extent_io.h and
ctree.h. Most of them can be converted to take constified eb as there
are no changes to the extent buffer structure itself but rather the
pages.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All uses of map_private_extent_buffer have been replaced by more
effective way. The set/get helpers have their own bounds checker.
The function name was confusing since the non-private helper was removed
in a65917156e ("Btrfs: stop using highmem for extent_buffers") many
years ago.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, direct I/O has its own versions of bio_readpage_error() and
btrfs_check_repairable() (dio_read_error() and
btrfs_check_dio_repairable(), respectively). The main difference is that
the direct I/O version doesn't do read validation. The rework of direct
I/O repair makes it possible to do validation, so we can get rid of
btrfs_check_dio_repairable() and combine bio_readpage_error() and
dio_read_error() into a new helper, btrfs_submit_read_repair().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Direct I/O read repair was originally implemented in commit 8b110e393c
("Btrfs: implement repair function when direct read fails"). This
implementation is unnecessarily complicated. There is major code
duplication between __btrfs_subio_endio_read() (checks checksums and
handles I/O errors for files with checksums),
__btrfs_correct_data_nocsum() (handles I/O errors for files without
checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
for retries of files with checksums), and btrfs_retry_endio_nocsum()
(handles I/O errors for retries of files without checksum). If it sounds
like these should be one function, that's because they should.
Additionally, these functions are very hard to follow due to their
excessive use of goto.
This commit replaces the original implementation. After the previous
commit getting rid of orig_bio, we can reuse the same endio callback for
repair I/O and the original I/O, we just need to track the file offset
and original iterator in the repair bio. We can also unify the handling
of files with and without checksums and simplify the control flow. We
also no longer have to wait for each repair I/O to complete one by one.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since its introduction in commit 2fe6303e7c ("Btrfs: split
bio_readpage_error into several functions"), btrfs_check_repairable()
has only been used from extent_io.c where it is defined.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the validation step is to distinguish between good and
bad sectors in a failed multi-sector read. If a multi-sector read
succeeded but some of those sectors had checksum errors, we don't need
to validate anything; we know the sectors with bad checksums need to be
repaired.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Read repair does two things: it finds a good copy of data to return to
the reader, and it corrects the bad copy on disk. If a read of multiple
sectors has an I/O error, repair does an extra "validation" step that
issues a separate read for each sector. This allows us to find the exact
failing sectors and only rewrite those.
This heuristic is implemented in
bio_readpage_error()/btrfs_check_repairable() as:
failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
if (failed_bio_pages > 1)
do validation
However, at this point, bi_iter may have already been advanced. This
means that we'll skip the validation step and rewrite the entire failed
read.
Fix it by getting the actual size from the biovec (which we can do
because this is only called for non-cloned bios, although that will
change in a later commit).
Fixes: 8a2ee44a37 ("btrfs: look at bi_size for repair decisions")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a few different ways to free roots, either you allocated them
yourself and you just do
free_extent_buffer(root->node);
free_extent_buffer(root->commit_node);
btrfs_put_root(root);
Which is the pattern for log roots. Or for snapshots/subvolumes that
are being dropped you simply call btrfs_free_fs_root() which does all
the cleanup for you.
Unify this all into btrfs_put_root(), so that we don't free up things
associated with the root until the last reference is dropped. This
makes the root freeing code much more significant.
The only caveat is at close_ctree() time we have to free the extent
buffers for all of our main roots (extent_root, chunk_root, etc) because
we have to drop the btree_inode and we'll run into issues if we hold
onto those nodes until ->kill_sb() time. This will be addressed in the
future when we kill the btree_inode.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sparse reports a warning at release_extent_buffer()
warning: context imbalance in release_extent_buffer() - unexpected unlock
The root cause is the missing annotation at release_extent_buffer()
Add the missing __releases(&eb->refs_lock) annotation
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.
This happens specifically at the following case at __set_extent_bit():
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.
The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.
This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.
Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.
This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:
"btrfs: introduce the inode->file_extent_tree"
This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.
Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:
1) A buffered write against file range [512KiB, 768KiB) is made. At this
point delalloc was not flushed yet;
2) Deduplication from some other inode into this inode's range
[128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
to be called, from btrfs_insert_clone_extent(), to mark the range
[128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;
3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
end up at __set_extent_bit(). In the first iteration of that function's
loop we end up in the following branch:
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
(...)
1089 goto search_again;
This splits the state record into two, one for range [0, 128KiB) and
another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
bit set. Then we jump to the 'search_again' label, where we unlock the
the spinlock protecting the extent io tree before jumping to the
'again' label to perform the next iteration;
4) In the meanwhile, delalloc is flushed, the ordered extent for the range
[512KiB, 768KiB) is created and when it completes, at
btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
with a value of 0 for its 'new_size' argument;
5) Before the deduplication task currently at __set_extent_bit() moves to
the next iteration, the task finishing the ordered extent calls
find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
and gets 'start' set to 0 and 'end' set to 128KiB - because at this
moment the io tree has two extent state records, one representing the
range [0, 128KiB) and another representing the range [128KiB, 1MiB),
both with EXTENT_DIRTY set. Then we set 'isize' to:
isize = min(isize, end + 1)
= min(1MiB, 128KiB - 1 + 1)
= 128KiB
Then we set the inode's disk_i_size to 128KiB (isize).
After a clean unmount of the filesystem and mounting it again, we have
the file with a size of 128KiB, and effectively lost all the data it
had before in the range from 128KiB to 1MiB.
This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode, use it and drop the
redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the page's inode, use it and
drop the redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add assertions to all helpers that get tree as argument and verify that
it's the same that can be obtained from the inode or from its pages. In
followup patches the redundant arguments and assertions will be removed
one by one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're sure the tree from argument is same as the one we can get
from the page's inode io_tree, drop the redundant argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All functions that set up extent_page_data::tree set it to the inode
io_tree. That's passed down the callstack that accesses either the same
inode or its pages. In the end submit_extent_page can pull the tree out
of the page and we don't have to store it in the structure.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to keep track of where we have file extents on disk, and thus
where it is safe to adjust the i_size to, we need to have a tree in
place to keep track of the contiguous areas we have file extents for.
Add helpers to use this tree, as it's not required for NO_HOLES file
systems. We will use this by setting DIRTY for areas we know we have
file extent item's set, and clearing it when we remove file extent items
for truncation.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
do not need this -EBUSY case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Raviu reported that running his regular fs_trim segfaulted with the
following backtrace:
[ 237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
[ 237.525984] ------------[ cut here ]------------
[ 237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
[ 237.525992] invalid opcode: 0000 [#1] SMP PTI
[ 237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G U OE 5.4.14-8-vanilla #1
[ 237.526001] Hardware name: ASUSTeK COMPUTER INC.
[ 237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
[ 237.526079] Call Trace:
[ 237.526120] find_first_clear_extent_bit+0x13d/0x150 [btrfs]
[ 237.526148] btrfs_trim_fs+0x211/0x3f0 [btrfs]
[ 237.526184] btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
[ 237.526219] btrfs_ioctl+0x129a/0x2ed0 [btrfs]
[ 237.526227] ? filemap_map_pages+0x190/0x3d0
[ 237.526232] ? do_filp_open+0xaf/0x110
[ 237.526238] ? _copy_to_user+0x22/0x30
[ 237.526242] ? cp_new_stat+0x150/0x180
[ 237.526247] ? do_vfs_ioctl+0xa4/0x640
[ 237.526278] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[ 237.526283] do_vfs_ioctl+0xa4/0x640
[ 237.526288] ? __do_sys_newfstat+0x3c/0x60
[ 237.526292] ksys_ioctl+0x70/0x80
[ 237.526297] __x64_sys_ioctl+0x16/0x20
[ 237.526303] do_syscall_64+0x5a/0x1c0
[ 237.526310] entry_SYSCALL_64_after_hwframe+0x49/0xbe
That was due to btrfs_fs_device::aloc_tree being empty. Initially I
thought this wasn't possible and as a percaution have put the assert in
find_first_clear_extent_bit. Turns out this is indeed possible and could
happen when a file system with SINGLE data/metadata profile has a 2nd
device added. Until balance is run or a new chunk is allocated on this
device it will be completely empty.
In this case find_first_clear_extent_bit should return the full range
[0, -1ULL] and let the caller handle this i.e for trim the end will be
capped at the size of actual device.
Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
Fixes: 45bfcfc168 ("btrfs: Implement find_first_clear_extent_bit")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There exists a deadlock with range_cyclic that has existed forever. If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out. The task traces are as follows
PID: 1329874 TASK: ffff889ebcdf3800 CPU: 33 COMMAND: "kworker/u113:5"
#0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
#1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
#2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
#3 [ffffc900297bb708] __lock_page at ffffffff811f145b
#4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
#5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
#6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
#7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
PID: 2167901 TASK: ffff889dc6a59c00 CPU: 14 COMMAND:
"aio-dio-invalid"
#0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
#1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
#2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
#3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
#4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
#5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
#6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
#7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
I used drgn to find the respective pages we were stuck on
page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148. aio-dio-invalid has 7680, and the kworker epd looks like the
following
crash> struct extent_page_data ffffc900297bbbb0
struct extent_page_data {
bio = 0xffff889f747ed830,
tree = 0xffff889eed6ba448,
extent_locked = 0,
sync_io = 0
}
Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).
Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on
bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
for i in range(0, bio.bi_vcnt.value_()):
bv = bio.bi_io_vec[i]
if bv.bv_page.value_() == 0xffffea00fbfc7500:
print("FOUND IT")
which validated what I suspected.
The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.
Fixes: b293f02e14 ("Btrfs: Add writepages support")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing. It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.
We unconditionally were setting scanned = 1; the first time we found any
pages in the inode. This isn't actually what we want, we want it to be
set if we've scanned the entire file. For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.
Fix this by not setting scanned = 1 if we find pages. The rules for
setting scanned should be
1) !range_cyclic. In this case we have a specified range to write out.
2) range_cyclic && index == 0. In this case we've started at the
beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
the file without satisfying our nr_to_write.
This patch fixes both of our writepages implementations to make sure
these rules hold true. This fixed our over zealous CG OOMs in
production.
Fixes: d1310b2e0c ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 7087a9d8db ("btrfs: Remove
extent_io_ops::writepage_end_io_hook") left this logic in a confusing
state. Simplify it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.
While we're here, let's document btrfs_get_extent().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __extent_writepage_io(), we check whether
i_size <= page_offset(page).
Note that if i_size < page_offset(page), then
i_size >> PAGE_SHIFT < page->index.
If i_size == page_offset(page), then
i_size >> PAGE_SHIFT == page->index && offset_in_page(i_size) == 0.
__extent_writepage() already has a check for these cases that
returns without calling __extent_writepage_io():
end_index = i_size >> PAGE_SHIFT
pg_offset = offset_in_page(i_size);
if (page->index > end_index ||
(page->index == end_index && !pg_offset)) {
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
unlock_page(page);
return 0;
}
Get rid of the one in __extent_writepage_io(), which was obsoleted in
211c17f51f ("Fix corners in writepage and btrfs_truncate_page").
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since 40f765805f ("Btrfs: split up __extent_writepage to lower stack
usage"), done_unlocked is simply a return 0. Get rid of it.
Mid-statement block returns don seem to make the code less readable here.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're initializing pg_offset to 0, setting it immediately, then
reassigning it to 0 again after. The former became unnecessary in
211c17f51f ("Fix corners in writepage and btrfs_truncate_page"). The
latter is a leftover that should've been removed in 40f765805f
("Btrfs: split up __extent_writepage to lower stack usage"). Remove
both.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Callers of alloc_test_extent_buffer have not correctly interpreted the
return value as error pointer, as alloc_test_extent_buffer should behave
as alloc_extent_buffer. The self-tests were unaffected but
btrfs_find_create_tree_block could call both functions and that would
cause problems up in the call chain.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.
Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.
In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.
Where do we know which bdev to set:
* for regular IO: submit_stripe_bio that's called by btrfs_map_bio
* repair IO: repair_io_failure, read or write from specific device
* super block write (using buffer_heads but uses raw bdev) and barriers
* scrub: this does not use all regular IO paths as it needs to reach all
copies, verify and fixup eventually, and for that all bdev management
is independent
* raid56: rbio_add_io_page, for the RMW write
* integrity-checker: does it's own low-level block tracking
Signed-off-by: David Sterba <dsterba@suse.com>
This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio
wbc_init_bio
bio_associate_blkg_from_css
dereference bdev->bi_disk->queue
The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.
Signed-off-by: David Sterba <dsterba@suse.com>
__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over. However,
instead of current index + 1, it gets to to the current index itself.
Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.
1. There is a single file which has accumulated enough dirty pages to
trigger balance_dirty_pages() and the writer appending to the file
with a series of short writes.
2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
3. Writeback kicks in and the cursor is on the last page of the dirty
file. Writeback is started or skipped if already in progress. As
it's EOF, extent_write_cache_pages() returns and the cursor is set
to done_index which is pointing to the last page.
4. Writeback is done. Nothing happens till balance_dirty_pages
finishes, at which point we go back to #1.
This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system. We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.
Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed. However, this concern seems bogus.
On each page, the code either:
* Skips and moves onto the next page.
* Fails issue and sets done_index to index + 1.
* Successfully issues and continue to the next page if budget allows
and not EOF.
IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page. Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related. Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.
Fix it by always setting done_index past the current page being
processed.
Note that this problem exists in other writepages too.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.
This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.
For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.
For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.
writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().
The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.
But, this is racey. If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:
compress_file_range(chunkA)
submit_compress_extents(chunkA)
submit compressed bios(chunkA)
put_page(locked_page)
compress_file_range(chunkB)
...
Or:
async_cow_submit
submit_compressed_extents <--- falls back to buffered writeout
cow_file_range
extent_clear_unlock_delalloc
__process_pages_contig
put_page(locked_pages)
async_cow_submit
The end result is that chunkA is completed and cleaned up before chunkB
even starts processing. This means we can free locked_page() and reuse
it elsewhere. If we get really lucky, it'll have the same page->index
in its new home as it did before.
While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.
Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:
in submit_compressed_extents():
while (async extents) {
...
cow_file_range
if (!page_started ...)
extent_write_locked_range
else if (...)
unlock_page
continue;
This hasn't been observed in practice but is still possible.
With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: percpu_counter_add_batch+0x11/0x70
PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 16 PID: 2172 Comm: rm Not tainted
RIP: 0010:percpu_counter_add_batch+0x11/0x70
RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
account_page_cleaned+0x15b/0x1f0
__cancel_dirty_page+0x146/0x200
truncate_cleanup_page+0x92/0xb0
truncate_inode_pages_range+0x202/0x7d0
btrfs_evict_inode+0x92/0x5a0
evict+0xc1/0x190
do_unlinkat+0x176/0x280
do_syscall_64+0x63/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.
Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2 ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
This needs to be cleaned up in the future, but for now it belongs to the
extent-io-tree stuff since it uses the internal tree search code.
Needed to export get_state_failrec and set_state_failrec as well since
we're not going to move the actual IO part of the failrec stuff out at
this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This utilizes internal stuff to the extent_io_tree, so we need to export
it before we move it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_io.c/h are huge, encompassing a bunch of different things. The
extent_io_tree code can live on its own, so separate this out.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are moving extent_io_tree into it's on file, so separate out the
extent_state init stuff from extent_io_tree_init().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check both extent buffer and extent state leaks in the same function,
separate these two functions out so we can move them around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2SDbMACgkQxWXV+ddt
WDsUhw/9HcRsT6SlrwA2R5leHxCR5UMwT2Zmbxpfft37ANF0SC1UINHBfnmquM97
xX6fdRSR9RUjF9DrdLPfLBnJDQ/MnHl1ruIVBFhJm6cJ9TJwf9E0TiJBQt+08JWg
vy5hZBWvsPWWRBJ94XPMe4LtakK/isW4Cz5W9AdrC2Siqw69j6eZzms2AnIjyBjA
BoKg4se2Ay2rMxLZWXIOj9374PU+N1cnRnqgh77ZxLku5WdCzrDfB5safE7UmoTG
/MWJuuIgzOk0iQpQORRtEZDS1dNe5KT9m4xXkUbrZbQROwqnXrT1SVIsuqNAvlPk
uaymR1W8nshepzpMlSxVydLv/mKWZNUGnDxOJ23ooow8Yd7ndppXEtFuGwCYqIFc
xQqxuTLREvJ9+jpSv11bmDpk/ULRqpV+2PjUqGaWlGwFArJ+qFRLVGYx31eXmDPj
t2mrPOcXGzY0pKtIpbkuUGleY/jeI+BNsvD4+QPs+jnp0nmfvH0/Rmp7grGqx2FI
rQM8Gn4a5i3nuEDWLp8nN2wcKC3ePwy96Vp2tqfsl6TVTPx4EFzGLkWogHR2yiqI
0LAj8YWFmWuChSv71wYOjX79CppjcbNwOakSwtDjV30jkwoh2f/0D3OwOpua2xe8
75KQMaSB0kesGZz7ZkL1kMqA5m5w7MGZom6XZoBJ+bq2HPLB2jo=
=2UM7
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A bunch of fixes that accumulated in recent weeks, mostly material for
stable.
Summary:
- fix for regression from 5.3 that prevents to use balance convert
with single profile
- qgroup fixes: rescan race, accounting leak with multiple writers,
potential leak after io failure recovery
- fix for use after free in relocation (reported by KASAN)
- other error handling fixups"
* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
btrfs: Fix a regression which we can't convert to SINGLE profile
btrfs: relocation: fix use-after-free on dead relocation roots
Btrfs: fix race setting up and completing qgroup rescan workers
Btrfs: fix missing error return if writeback for extent buffer never started
btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
If lock_extent_buffer_for_io() fails, it returns a negative value, but its
caller btree_write_cache_pages() ignores such error. This means that a
call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
failed. We should make btree_write_cache_pages() notice such error values
and stop immediatelly, making sure filemap_fdatawrite_range() returns an
error to the transaction commit path. A failure from flush_write_bio()
should also result in the endio callback end_bio_extent_buffer_writepage()
being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
there's no risk a transaction or log commit doesn't catch a writeback
failure.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before, if a eb failed to write out, we would end up triggering a
BUG_ON(). As of f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in
flush_write_bio() one level up"), we no longer BUG_ON(), so we should
make life consistent and add back the unwritten bytes to
dirty_metadata_bytes.
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Filipe Manana <fdmanana@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1/hCoACgkQxWXV+ddt
WDs0lQ//flGLX4fvaY2vuWA26t1elITnIatyX8S+xP4pUsT1Tyy1egeGpR8Jku/7
sCOgUlEM2MNXqveOdkQqPJuFPp3B6tInz4S/fowtLlz4enp7uTXw2SFuS3bhOJ+b
rpxK9VTc6QV3aipBCG31m8fnDiMaj2Hcspp0oej3V2mBhLUvzn69+P4eo7WN+46w
r2F605+lfURauHE6WjM09HINx3NGSfPqdSA5rJvHSm0jlxhb9l3DJOX8cYkbf8lo
MAbLDZmtiDiQAqRcsQPi6LZ1LKBkOYaeSnVvnXnH23FI04LBra3duk03qpvWCW2R
c1tFnKF5vACCyBQp1z8WYP9GjjoW5WT33R2iXufgwXP6pkLpS/12qLLeXqO2K4p5
zINKrIkF3P+GHxiDsQZE3G9A4UpKWFHCxKdxyWIV8LQDEBrgE2Mo3NThEyRBbP+8
1dia4j+qFHvPTMNBvBCjCZMqDwbCe9H70WOXKGE36JITW2le91mn4qHl4SuWReUP
IoHYDVcC/eBGRegc9X+bLJNjJYqo+XFo6u32/fUC5YVhngycQEi2vg1vv8fWQ7dB
g/Ruo3Inrk8h5kPmrHvbOzGazgANIt5ELHrYMRMA5WSgaq29jtGt9oTnsrd+I88G
aPJtwAZfLwdSjl/pwJw8atEPrf04DA2w+gO7rZ/AmeLshnGfOTc=
=bY+a
-----END PGP SIGNATURE-----
Merge tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This continues with work on code refactoring, sanity checks and space
handling. There are some less user visible changes, nothing that would
particularly stand out.
User visible changes:
- tree checker, more sanity checks of:
- ROOT_ITEM (key, size, generation, level, alignment, flags)
- EXTENT_ITEM and METADATA_ITEM checks (key, size, offset,
alignment, refs)
- tree block reference items
- EXTENT_DATA_REF (key, hash, offset)
- deprecate flag BTRFS_SUBVOL_CREATE_ASYNC for subvolume creation
ioctl, scheduled removal in 5.7
- delete stale and unused UAPI definitions
BTRFS_DEV_REPLACE_ITEM_STATE_*
- improved export of debugging information available via existing
sysfs directory structure
- try harder to delete relations between qgroups and allow to delete
orphan entries
- remove unreliable space checks before relocation starts
Core:
- space handling:
- improved ticket reservations and other high level logic in
order to remove special cases
- factor flushing infrastructure and use it for different
contexts, allows to remove some special case handling
- reduce metadata reservation when only updating inodes
- reduce global block reserve minimum size (affects small
filesystems)
- improved overcommit logic wrt global block reserve
- tests:
- fix memory leaks in extent IO tree
- catch all TRIM range
Fixes:
- fix ENOSPC errors, leading to transaction aborts, when cloning
extents
- several fixes for inode number cache (mount option inode_cache)
- fix potential soft lockups during send when traversing large trees
- fix unaligned access to space cache pages with SLUB debug on
(PowerPC)
Other:
- refactoring public/private functions, moving to new or more
appropriate files
- defines converted to enums
- error handling improvements
- more assertions and comments
- old code deletion"
* tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (138 commits)
btrfs: Relinquish CPUs in btrfs_compare_trees
btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic
btrfs: create structure to encode checksum type and length
btrfs: turn checksum type define into an enum
btrfs: add enospc debug messages for ticket failure
btrfs: do not account global reserve in can_overcommit
btrfs: use btrfs_try_granting_tickets in update_global_rsv
btrfs: always reserve our entire size for the global reserve
btrfs: change the minimum global reserve size
btrfs: rename btrfs_space_info_add_old_bytes
btrfs: remove orig_bytes from reserve_ticket
btrfs: fix may_commit_transaction to deal with no partial filling
btrfs: rework wake_all_tickets
btrfs: refactor the ticket wakeup code
btrfs: stop partially refilling tickets when releasing space
btrfs: add space reservation tracepoint for reserved bytes
btrfs: roll tracepoint into btrfs_space_info_update helper
btrfs: do not allow reservations if we have pending tickets
btrfs: stop clearing EXTENT_DIRTY in inode I/O tree
btrfs: treat RWF_{,D}SYNC writes as sync for CRCs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl16aTEACgkQxWXV+ddt
WDvICg//cSn5w+g6EnxbrAZ6IYQJ4GA7lZSk2i6Dc/lI3BTfs7Wj0SPRKd01pBjT
N+wbqoOgubsS1jkNfJsGCN80XzSa0tvyQdbezj5ncgSPXp4FRlT0K24EUQNPaqbg
SsvvxAOCerVN3Yj2qrHNWIS5qZ5/8/NjLXca1DJ/OYmrkKfhe+Z6/b9EuKffPnco
erMnaeSvQ27hYkkcdM0DGcWDoHHAQrefGNjQzp5vncJNN1F7+EGLbcH31UwApk1K
/hvOQ6Q6SoR/NKbQu3AitrR9u7v9uhWP9jHJZT46q1m89CzI4S5FjK2wKZFjPE6r
0PGRqnpdaGAERaTo3s6jIqv/X2gzJkhhhzGMiPgPJCQbAH39f/fFGEX22TjG33Yq
2CiGSIPnmKQ7HE494YLuSyHD/89SutMMCkbF0sFBoKuTnu2HQMn9r5Pk6bEKtvIY
iTk75/WTXR02qWCVhTyNDa9QnxewQGJC1d1KNQ6MwbzBiYyG9S/DDZnjLJPNx7DF
KAAANCDdyPpraLcmw2sD/qd1o10HfQmn9z1L2v3YvJBfjMe76SQFCP5WwaJRcjOm
c3ScAX9bXeXJgH+E98kWc7T6p49IPdMDGAtArQmtjO4V8pFRuqG+2Ibg6Za/y5XZ
fkaS5UY+XIk3TUpEqkWKMPMigM9a3jgHskyMgdRLQfVnoOc6Z+k=
=KXB8
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Here are two fixes, one of them urgent fixing a bug introduced in 5.2
and reported by many users. It took time to identify the root cause,
catching the 5.3 release is higly desired also to push the fix to 5.2
stable tree.
The bug is a mess up of return values after adding proper error
handling and honestly the kind of bug that can cause sleeping
disorders until it's caught. My appologies to everybody who was
affected.
Summary of what could happen:
1) either a hang when committing a transaction, if this happens
there's no risk of corruption, still the hang is very inconvenient
and can't be resolved without a reboot
2) writeback for some btree nodes may never be started and we end up
committing a transaction without noticing that, this is really
serious and that will lead to the "parent transid verify failed"
messages"
* tag 'for-5.3-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
Btrfs: fix assertion failure during fsync and use of stale transaction
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136a ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit fee187d9d9 ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It was added in ba8b04c1d4 ("btrfs: extend btrfs_set_extent_delalloc
and its friends to support in-band dedupe and subpage size patchset") as
a preparatory patch for in-band and subapge block size patchsets.
However neither of those are likely to be merged anytime soon and the
code has diverged significantly from the last public post of either
of those patchsets.
It's unlikely either of the patchests are going to use those preparatory
steps so just remove the variables. Since cow_file_range also took
delalloc_end to pass it to extent_clear_unlock_delalloc remove the
parameter from that function as well.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl0sNWYACgkQxWXV+ddt
WDsyQA/8CGnF68g6hwVuYz4K7f39gOiFlBnRxeN/3RT6vkNSyLZxvRDaDrSTzVIo
cz2G/9qZLXsIll+3EfZlyzZZiA+4f4hEDAfAd4yVPavRom+uu7dbqzAIpgvFlYdH
vhAYKOeWSqWElWJ06hzWO3FCwjY9GKFMk4PS0XHHp+STCT0hq1MkaHr44kiHsqdh
T5nVGDwXz8nGDZ51RO6+mgiSrd5eHbs6kXCd8rW7hmjTx8ClKHa1tdkxN/us+pJm
hTFT669m5ckHhY2AUKmkREoOwpnt2HcXQJNkz6gO+o03IDvYz73SScbhSYdNTlwi
j74GLf89FA52qVM+JDg9MaWYqgf1pQI8AHK/rXw2FNbuP/eL9kuZ85ZIbO6CiO0c
5jAixReSwzSP/V0+MKW3F7k4KtIqbHAV6mkI8zLwrAee4Xj81BOtgL7gYPFQTwSZ
ma0hEoen7IV5+/z9upUuLA5wr4BT+h1T+EllCWe1+9+9mRYOvowtkRNBL8HZWTDI
b65oTITfot54xX9ecKtiuG2qoqJEjjkR+YKdRM4nph6wflSNZxEoezBp3iRFpYOL
Lx+g97RcJ2EEoBVjVMkTqfj93GeiKRifa8yXdRY+A0I2ZXZEcS8DjSJM6rj3AOPy
4idIl+ABscayZowfqu0FSIULf1La0qiRXmbGNeG4ylhN4L6S/og=
=eshk
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Highlights:
- chunks that have been trimmed and unchanged since last mount are
tracked and skipped on repeated trims
- use hw assissed crc32c on more arches, speedups if native
instructions or optimized implementation is available
- the RAID56 incompat bit is automatically removed when the last
block group of that type is removed
Fixes:
- fsync fix for reflink on NODATACOW files that could lead to ENOSPC
- fix data loss after inode eviction, renaming it, and fsync it
- fix fsync not persisting dentry deletions due to inode evictions
- update ctime/mtime/iversion after hole punching
- fix compression type validation (reported by KASAN)
- send won't be allowed to start when relocation is in progress, this
can cause spurious errors or produce incorrect send stream
Core:
- new tracepoints for space update
- tree-checker: better check for end of extents for some tree items
- preparatory work for more checksum algorithms
- run delayed iput at unlink time and don't push the work to cleaner
thread where it's not properly throttled
- wrap block mapping to structures and helpers, base for further
refactoring
- split large files, part 1:
- space info handling
- block group reservations
- delayed refs
- delayed allocation
- other cleanups and refactoring"
* tag 'for-5.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (103 commits)
btrfs: fix memory leak of path on error return path
btrfs: move the subvolume reservation stuff out of extent-tree.c
btrfs: migrate the delalloc space stuff to it's own home
btrfs: migrate btrfs_trans_release_chunk_metadata
btrfs: migrate the delayed refs rsv code
btrfs: Evaluate io_tree in find_lock_delalloc_range()
btrfs: migrate the global_block_rsv helpers to block-rsv.c
btrfs: migrate the block-rsv code to block-rsv.c
btrfs: stop using block_rsv_release_bytes everywhere
btrfs: cleanup the target logic in __btrfs_block_rsv_release
btrfs: export __btrfs_block_rsv_release
btrfs: export btrfs_block_rsv_add_bytes
btrfs: move btrfs_block_rsv definitions into it's own header
btrfs: Simplify update of space_info in __reserve_metadata_bytes()
btrfs: unexport can_overcommit
btrfs: move reserve_metadata_bytes and supporting code to space-info.c
btrfs: move dump_space_info to space-info.c
btrfs: export block_rsv_use_bytes
btrfs: move btrfs_space_info_add_*_bytes to space-info.c
btrfs: move the space info update macro to space-info.h
...
wbc_account_io() does a very specific job - try to see which cgroup is
actually dirtying an inode and transfer its ownership to the majority
dirtier if needed. The name is too generic and confusing. Let's
rename it to something more specific.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently if the allocation of roots or tmp_ulist fails the error handling
does not free up the allocation of path causing a memory leak. Fix this and
other similar leaks by moving the call of btrfs_free_path from label out
to label out_free_ulist.
Kudos to David Sterba for spotting the issue in my original fix and suggesting
the correct way to fix the leak and Anand Jain for spotting a double free
issue.
Addresses-Coverity: ("Resource leak")
Fixes: 5911c8fe05 ("btrfs: fiemap: preallocate ulists for btrfs_check_shared")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplification. No point passing the tree variable when it can be
evaluated from inode. The tests now use the io_tree from btrfs_inode as
opposed to creating one.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.
Signed-off-by: David Sterba <dsterba@suse.com>
The write_locks is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The spinning_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Signed-off-by: David Sterba <dsterba@suse.com>
The blocking_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function has a lot of return values and specific conventions making
it cumbersome to understand what's returned. Have a go at documenting
its parameters and return values.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently find_first_clear_extent_bit always returns a range whose
starting value is >= passed 'start'. This implicit trimming behavior is
somewhat subtle and an implementation detail.
Instead, this patch modifies the function such that now it always
returns the range which contains passed 'start' and has the given bits
unset. This range could either be due to presence of existing records
which contains 'start' but have the bits unset or because there are no
records that contain the given starting offset.
This patch also adds test cases which cover find_first_clear_extent_bit
since they were missing up until now.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There several functions which open code
btrfs_lock_and_flush_ordered_range, just replace them with a call to the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At the context of btrfs_run_delalloc_range(), we haven't started/joined
a transaction, thus even something went wrong, we can't and won't abort
transaction, thus no way to make the fs RO.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_check_shared looks up parents of a given extent and uses ulists
for that. These are allocated and freed repeatedly. Preallocation in the
caller will avoid the overhead and also allow us to use the GFP_KERNEL
as it is happens before the extent locks are taken.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
RcCuPeLAVg==
=Ot0g
-----END PGP SIGNATURE-----
Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Nothing major in this series, just fixes and improvements all over the
map. This contains:
- Series of fixes for sed-opal (David, Jonas)
- Fixes and performance tweaks for BFQ (via Paolo)
- Set of fixes for bcache (via Coly)
- Set of fixes for md (via Song)
- Enabling multi-page for passthrough requests (Ming)
- Queue release fix series (Ming)
- Device notification improvements (Martin)
- Propagate underlying device rotational status in loop (Holger)
- Removal of mtip32xx trim support, which has been disabled for years
(Christoph)
- Improvement and cleanup of nvme command handling (Christoph)
- Add block SPDX tags (Christoph)
- Cleanup/hardening of bio/bvec iteration (Christoph)
- A few NVMe pull requests (Christoph)
- Removal of CONFIG_LBDAF (Christoph)
- Various little fixes here and there"
* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
block: fix mismerge in bvec_advance
block: don't drain in-progress dispatch in blk_cleanup_queue()
blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
blk-mq: always free hctx after request queue is freed
blk-mq: split blk_mq_alloc_and_init_hctx into two parts
blk-mq: free hw queue's resource in hctx's release handler
blk-mq: move cancel of requeue_work into blk_mq_release
blk-mq: grab .q_usage_counter when queuing request from plug code path
block: fix function name in comment
nvmet: protect discovery change log event list iteration
nvme: mark nvme_core_init and nvme_core_exit static
nvme: move command size checks to the core
nvme-fabrics: check more command sizes
nvme-pci: check more command sizes
nvme-pci: remove an unneeded variable initialization
nvme-pci: unquiesce admin queue on shutdown
nvme-pci: shutdown on timeout during deletion
nvme-pci: fix psdt field for single segment sgls
nvme-multipath: don't print ANA group state by default
nvme-multipath: split bios with the ns_head bio_set before submitting
...
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the implementers of the submit_bio_hook use the bio_offset
parameter, simply remove it. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always uses the btree inode's io_tree. Stop taking the
tree as a function argument and instead access it internally from
read_extent_buffer_pages. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is very similar to find_first_extent_bit except that it
locates the first contiguous span of space which does not have bits set.
It's intended use is in the freespace trimming code.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It will be used in a future patch that will require modifying an
extent_io_tree struct under a spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is going to be used to clear out the device extent
allocation information. Give it a more generic name and export it. This
is in preparation to replacing the pending/pinned chunk lists with an
extent tree. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can read fs_info from extent buffer and can drop it from the
parameters. As all callsites are updated, add the btrfs_ prefix as the
function is exported.
Signed-off-by: David Sterba <dsterba@suse.com>
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0
Change it to BUG() so clang can see that this code path can never
continue.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Long time ago (2008), the extent buffers were organized in a LRU list
and switched to rb-tree in 6af118ce51 ("Btrfs: Index extent
buffers in an rbtree"). There was one stale macro definition left.
Signed-off-by: David Sterba <dsterba@suse.com>
We can only get <=0 from extent_write_cache_pages, add an ASSERT() for
it just in case.
Then instead of submitting the write bio even if we got some error,
check the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function needs some extra checks on locked pages and eb. For error
handling we need to unlock locked pages and the eb.
There is a rare >0 return value branch, where all pages get locked
while write bio is not flushed.
Thankfully it's handled by the only caller, btree_write_cache_pages(),
as later write_one_eb() call will trigger submit_one_bio(). So there
shouldn't be any problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can only get @ret <= 0. Add an ASSERT() for it just in case.
Then, instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since __extent_writepage() will no longer return >0 value,
(ret == AOP_WRITEPAGE_ACTIVATE) will never be true.
Kill that dead branch.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btree_write_cache_pages(), we can only get @ret <= 0.
Add an ASSERT() for it just in case.
Then instead of submitting the write bio even we got some error, check
the return value first.
If we have already hit some error, just clean up the corrupted or
half-baked bio, and return error.
If there is no error so far, then call flush_write_bio() and return the
result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since now flush_write_bio() could return error, kill the BUG_ON() first.
Then don't call flush_write_bio() unconditionally, instead we check the
return value from __extent_writepage() first.
If __extent_writepage() fails, we do cleanup, and return error without
submitting the possible corrupted or half-baked bio.
If __extent_writepage() successes, then we call flush_write_bio() and
return the result.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON() in flush_write_bio() to handle the return value of
submit_one_bio().
Move the BUG_ON() one level up to all its callers.
This patch will introduce temporary variable, @flush_ret to keep code
change minimal in this patch. That variable will be cleaned up when
enhancing the error handling later.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case we hit the error case for a metadata buffer in
end_bio_extent_readpage then 'ret' won't really be checked before it's
written again to. This means the -EIO in this case will never be
checked, just remove it.
Fixes-coverity-id: 1442513
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently extent_readpages (called from btrfs_readpages) will always
call __extent_readpages which tries to create contiguous range of pages
and call __do_contiguous_readpages when such contiguous range is
created.
It turns out this is unnecessary due to the fact that generic MM code
always calls filesystem's ->readpages callback (btrfs_readpages in
this case) with already contiguous pages. Armed with this knowledge it's
possible to simplify extent_readpages by eliminating the call to
__extent_readpages and directly calling contiguous_readpages.
The only edge case that needs to be handled is when
add_to_page_cache_lru fails. This is easy as all that is needed is to
submit whatever is the number of pages successfully added to the lru.
This can happen when the page is already in the range, so it does not
need to be read again, and we can't do anything else in case of other
errors.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member is tracking simple status of the lock, we can use bool for
that and make some room for further space reduction in the structure.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::write_locks become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::read_locks become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_readers become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the helpers where open coded. On non-debug builds, the warnings will
not trigger and extent_buffer::spining_writers become unused and can be
moved to the appropriate section, saving a few bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag just became synonymous to EXTENT_LOCKED, so just remove it and
used EXTENT_LOCKED directly. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This flag was introduced in a52d9a8033 ("Btrfs: Extent based page
cache code.") and subsequently it's usage effectively was removed by
1edbb734b4 ("Btrfs: reduce CPU usage in the extent_state tree") and
f2a97a9dbd ("btrfs: remove all unused functions"). Just remove it,
no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although btrfs heavily relies on extent_io_tree, we don't really have
any good trace events for them.
This patch will add the folowing trace events:
- trace_btrfs_set_extent_bit()
- trace_btrfs_clear_extent_bit()
- trace_btrfs_convert_extent_bit()
Since selftests could create temporary extent_io_tree without fs_info,
modify TP_fast_assign_fsid() to accept NULL as fs_info. NULL fs_info
will lead to all zero fsid.
The output would be:
btrfs_set_extent_bit: <FDID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22040576 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22044672 len=4096 set_bits=LOCKED
btrfs_set_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22048768 len=4096 set_bits=LOCKED
btrfs_clear_extent_bit: <FSID>: io_tree=INODE_IO ino=1 root=1 start=22036480 len=16384 clear_bits=LOCKED
^^^ Extent buffer 22036480 read from disk, the locking progress
btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30425088 len=16384 set_bits=DIRTY
btrfs_set_extent_bit: <FSID>: io_tree=TRANS_DIRTY_PAGES ino=1 root=1 start=30441472 len=16384 set_bits=DIRTY
^^^ 2 new tree blocks allocated in one transaction
btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30523392 len=16384 set_bits=DIRTY
btrfs_set_extent_bit: <FSID>: io_tree=FREED_EXTENTS0 ino=0 root=0 start=30556160 len=16384 set_bits=DIRTY
^^^ 2 old tree blocks get pinned down
There is one point which need attention:
1) Those trace events can be pretty heavy:
The following workload would generate over 400 trace events.
mkfs.btrfs -f $dev
start_trace
mount $dev $mnt -o enospc_debug
sync
touch $mnt/file1
touch $mnt/file2
touch $mnt/file3
xfs_io -f -c "pwrite 0 16k" $mnt/file4
umount $mnt
end_trace
It's not recommended to use them in real world environment.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename enums ]
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has the following different extent_io_trees used:
- fs_info::free_extents[2]
- btrfs_inode::io_tree - for both normal inodes and the btree inode
- btrfs_inode::io_failure_tree
- btrfs_transaction::dirty_pages
- btrfs_root::dirty_log_pages
If we want to trace changes in those trees, it will be pretty hard to
distinguish them.
Instead of using hard-to-read pointer address, this patch will introduce
a new member extent_io_tree::owner to track the owner.
This modification needs all the callers of extent_io_tree_init() to
accept a new parameter @owner.
This patch provides the basis for later trace events.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will add a new member fs_info to extent_io_tree.
This provides the basis for later trace events to distinguish the output
between different btrfs filesystems. While this increases the size of
the structure, we want to know the source of the trace events and
passing the fs_info as an argument to all contexts is not possible.
The selftests are now allowed to set it to NULL as they don't use the
tracepoints.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlyHwgcACgkQxWXV+ddt
WDvfFQ//VBO8Rnz0+V4XbxaIaz25EbCjR4cBuXwwXyl8HPJMZBvBPqW0LVXtV0eP
SwK0A5qiWqaXgWNpByD43AvgizZuWF/9SvxebaCKTjSK5t9TuXpR27vJNnHJf0L0
o4DeMXlgd8yE8yZstQo7UnLWfNU69v6Pi3Zbar/7IIJ0sVVCPMMoGoARZDlQ+w0M
wwppi04+a6bnAUbqpnWiL0a8++WX6gqP7MovqLRgf/up4cmzmDFoV7b/7pbvZxNv
LrKQBmJZQq44bW4TXMzhpkrIGyzrrUQuBhpbYJus9yZYqS6Owkzl5AQpdzo9reg2
V35xOkOZbXxqdOTGY0he9Z6wxJL+ocfryfRyA2hE4gXbCAnfFqIRyFicpTXuXxwg
RBan8VLB+1iC7j9djX/sP/uCH3tsPgN4WnjdZgnkUOkUhTuvpPw/A9bp6Uqfjr6g
JU0o/TlCC8npaveUQsuNbqyVYgPk58d9by12HsSW7UaA8ENyHz62+zoiv9jX/uZY
Tl4t2L+MKxEcsd0KEKEQpV+0hV56GtYcIZIqJTe9WFmPBHmEH3PCHDx4A5LrYveO
hC+hGAnX9xWK4XIr8T3ck1tsnxNApD25pmKSivadUiVJqOrPpJFyZb3aztcKcx4Y
sDbZdOV7XHq6ACrIhLoxpYWQc27v1FqrWVqsF51wo07I3meUVaA=
=u4Kf
-----END PGP SIGNATURE-----
Merge tag 'for-5.1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Correctness and a deadlock fixes"
* tag 'for-5.1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zstd: ensure reclaim timer is properly cleaned up
btrfs: move ulist allocation out of transaction in quota enable
btrfs: save drop_progress if we drop refs at all
btrfs: check for refs on snapshot delete resume
Btrfs: fix deadlock between clone/dedupe and rename
Btrfs: fix corruption reading shared and compressed extents after hole punching
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
DRVVhYpocw==
=VBaU
-----END PGP SIGNATURE-----
Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"Not a huge amount of changes in this round, the biggest one is that we
finally have Mings multi-page bvec support merged. Apart from that,
this pull request contains:
- Small series that avoids quiescing the queue for sysfs changes that
match what we currently have (Aleksei)
- Series of bcache fixes (via Coly)
- Series of lightnvm fixes (via Mathias)
- NVMe pull request from Christoph. Nothing major, just SPDX/license
cleanups, RR mp policy (Hannes), and little fixes (Bart,
Chaitanya).
- BFQ series (Paolo)
- Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
for the fast path (Jianchao)
- fops->iopoll() added for async IO polling, this is a feature that
the upcoming io_uring interface will use (Christoph, me)
- Partition scan loop fixes (Dongli)
- mtip32xx conversion from managed resource API (Christoph)
- cdrom registration race fix (Guenter)
- MD pull from Song, two minor fixes.
- Various documentation fixes (Marcos)
- Multi-page bvec feature. This brings a lot of nice improvements
with it, like more efficient splitting, larger IOs can be supported
without growing the bvec table size, and so on. (Ming)
- Various little fixes to core and drivers"
* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
block: fix updating bio's front segment size
block: Replace function name in string with __func__
nbd: propagate genlmsg_reply return code
floppy: remove set but not used variable 'q'
null_blk: fix checking for REQ_FUA
block: fix NULL pointer dereference in register_disk
fs: fix guard_bio_eod to check for real EOD errors
blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
block: optimize bvec iteration in bvec_iter_advance
block: introduce mp_bvec_for_each_page() for iterating over page
block: optimize blk_bio_segment_split for single-page bvec
block: optimize __blk_segment_map_sg() for single-page bvec
block: introduce bvec_nth_page()
iomap: wire up the iopoll method
block: add bio_set_polled() helper
block: wire up block device iopoll method
fs: add an iopoll method to struct file_operations
loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
loop: do not print warn message if partition scan is successful
block: bounce: make sure that bvec table is updated
...
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b467 ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b467 ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to forward declare flush_write_bio(), as it only
depends on submit_one_bio(). Both of them are pretty small, just move
them to kill the forward declaration.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variables and function parameters of __etree_search which pertain to
prev/next are grossly misnamed. Namely, prev_ret holds the next state
and not the previous. Similarly, next_ret actually holds the previous
extent state relating to the offset we are interested in. Fix this by
renaming the variables as well as switching the arguments order. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the refactoring introduced in 8b62f87bad ("Btrfs: reworki
outstanding_extents") this flag became unused. Remove it and renumber
the following flags accordingly. No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is a simple wrapper over btrfs_get_extent that returns
either:
a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.
To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.
Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O. But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying. Use bi_size for that and shift by
PAGE_SHIFT. This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Multiple filesystems open code lru_to_page(). Rectify this by moving
the macro from mm_inline (which is specific to lru stuff) to the more
generic mm.h header and start using the macro where appropriate.
No functional changes.
Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com
Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Pankaj gupta <pagupta@redhat.com>
Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The typos accumulate over time so once in a while time they get fixed in
a large patch.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
extent_readpages processes all pages in the readlist in batches of 16,
this is implemented by a single for loop but thanks to an if condition
the loop does 2 things based on whether we've filled the batch or not.
Additionally due to the structure of the code there is an additional
check which deals with partial batches.
Streamline all of this by explicitly using two loops. The outter one is
used to process all pages while the inner one just fills in the batch
of 16 (currently). Due to this new structure the code guarantees that
all pages are processed in the loop hence the code to deal with any
leftovers is eliminated.
This also enable the compiler to inline __extent_readpages:
./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for
add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
Function old new delta
extent_readpages 476 1136 +660
__extent_readpages 820 - -820
Total: Before=44315, After=44155, chg -0.36%
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.
So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.
But later on it is overwritten with either the offset from the extent
buffer's start or 0.
So get rid of the initial initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.
This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.
Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.
As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
lock_delalloc_pages should only return 2 values - 0 in case of success
and -EAGAIN if the range of pages to be locked should be shrunk due to
some of gone. Manual inspections confirms that this is indeed the case
since __process_pages_contig is where lock_delalloc_pages gets its
return value. The latter always returns 0 or -EAGAIN so the invariant
holds. No functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.
This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.
We do a simple snapshot speed test on a Intel D-1531 box:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 1m58sec
patched: 6.54sec
This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.
For a multi writers, random write test:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 15.83sec
patched: 10.35sec
The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only remaining use of the 'epd' argument in writepage_delalloc is
to reference the extent_io_tree which was set in extent_writepages. Since
it is guaranteed that page->mapping of any page passed to
writepage_delalloc (and __extent_writepage as the sole caller) to be
equal to that passed in extent_writepages we can directly get the
io_tree via the already passed inode (which is also taken from
page->mapping->host). No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If epd::extent_locked is set then writepage_delalloc terminates. Make
this a bit more apparent in the caller by simply bubbling the check up.
This enables to remove epd as an argument to writepage_delalloc in a
future patch. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loop construct in free_extent_buffer was added in
242e18c7c1 ("Btrfs: reduce lock contention on extent buffer locks")
as means of reducing the times the eb lock is taken, the non-last ref
count is decremented and lock is released. As the special handling
of UNMAPPED extent buffers was removed now there is only one decrement
op which is happening for EXTENT_BUFFER_UNMAPPED case.
This commit modifies the loop condition so that in case of UNMAPPED
buffers the eb's lock is taken only if we are 100% sure the eb is going
to be freed by the current executor of the code. Additionally, remove
superfluous ref count ops in btrfs test.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the whole of btrfs code has been audited for eb reference count
management it's time to remove the hunk in free_extent_buffer that
essentially considered the condition
"eb->ref == 2 && EXTENT_BUFFER_DUMMY"
to equal "eb->ref = 1". Also remove the last location
which takes an extra reference count in alloc_test_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.
This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.
Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page"). Update the name and comment.
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.
The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").
However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltSwkIACgkQxWXV+ddt
WDtMBQ//UXMjHaXvFmC0SM6NczuYQR51hLYtJFIKig93XK5goVpUBTNbxO7LX/Tn
4zKoKyVhkW1884V6mRiC+G23QLbo0BQZA7DExfyJ3jylQdjZMBm+K+r19OtGQf5v
CII7oUwni03KIiXIqiFAL5dLWebVQpG5EKJbh8GLZsmg6xNcyVaUqZ/fHXajbZiv
ldEBtHBKIv7WWTJmylMBKMWnRz+jqU91fXPahoU6R5qivODrLt1o/PMuSjVNhaxe
iDldHfdOaiQmLHB/1kOGyv492oW5mSSVNDE8LjEDZ61tDNlAcUyuKUWIRBxDEDtD
6D7rlVQXJ/N7sJ6+UYmJKsRpHL+NOkyzSZ0QEU/sm1Xpm8gkhHuuofRPrVCtd3l1
ZSbwvlrdyjigVEBfM3IbToQ/K6Rc1ZGId20OAs9PCQbb+mj9IxPIncZ7pI1c4hlh
pPEjcYsp14JbCTjctFalcqTiFY5tHRQsx+GUFnDyOcdL7Mi+CoH+0Jy61Vgz9GQE
7s934cfEC0ot/f66kAL/PZzxUfC7TePqaa+sDfS5BIkJ4M6lPMxS5De5R4Z0+Nzr
DXgQAlgXmxfRjpOYMTH9D0EDdSeJaNmVHgk7hFbiYk/KX3oyd4NmgI9Cfao8rQJv
2yd8wF2httfSJKD4b/Hv9r6Ho/Bw9PK59BvWOKYhSj6IGl32utw=
=f7eB
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix of a corruption regarding fsync and clone, under some very
specific conditions explained in the patch.
The fix is marked for stable 3.16+ so I'd like to get it merged now
given the impact"
* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix file data corruption after cloning a range and fsync
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).
The following example, turned into a test for fstests, reproduces the
issue:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure
In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.
Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).
CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
=fCdm
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and an incorrect error value propagation fix from
'rename exchange'"
* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix return value on rename exchange failure
btrfs: fix invalid-free in btrfs_extent_same
Btrfs: fix physical offset reported by fiemap for inline extents
Commit 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when
fm_extent_count is zero") introduced a regression where we no longer
report 0 as the physical offset for inline extents (and other extents
with a special block_start value). This is because it always sets the
variable used to report the physical offset ("disko") as em->block_start
plus some offset, and em->block_start has the value 18446744073709551614
((u64) -2) for inline extents.
This made the btrfs test 004 (from fstests) often fail, for example, for
a file with an inline extent we have the following items in the subvolume
tree:
item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160
generation 25 transid 38 size 1525 nbytes 1525
block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x2(none)
atime 1529342058.461891730 (2018-06-18 18:14:18)
ctime 1529342058.461891730 (2018-06-18 18:14:18)
mtime 1529342058.461891730 (2018-06-18 18:14:18)
otime 1529342055.869892885 (2018-06-18 18:14:15)
item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13
index 25 namelen 3 name: fc7
item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546
generation 38 type 0 (inline)
inline extent data size 1525 ram_bytes 1525 compression 0 (none)
Then when test 004 invoked fiemap against the file it got a non-zero
physical offset:
$ filefrag -v /mnt/p0/d4/d7/fc7
Filesystem type is: 9123683e
File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 4095: 18446744073709551614.. 4093: 4096: last,not_aligned,inline,eof
/mnt/p0/d4/d7/fc7: 1 extent found
This resulted in the test failing like this:
btrfs/004 49s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2016-08-23 10:17:35.027012095 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad 2018-06-18 18:15:02.385872155 +0100
@@ -1,3 +1,10 @@
QA output created by 004
*** test backref walking
-*** done
+./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer expression expected
+ERROR: 7.55578637259143e+22 is not a valid numeric value.
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -s 65536 -P 7.55578637259143e+22 /home/fdmanana/btrfs-tests/scratch_1
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
The large number in scientific notation reported as an invalid numeric
value is the result from the filter passed to perl which multiplies the
physical offset by the block size reported by fiemap.
So fix this by ensuring the physical offset is always set to 0 when we
are processing an extent with a special block_start value.
Fixes: 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsg598ACgkQxWXV+ddt
WDtG1w//R9/nvAk5A5cbForTXNyxwXAQnz0t2/4Lh5igbfJloqoTZtr47Gvqsvy+
DITU+3BPcyupBuUFLoeivPC+ruKOUmle27Vm62mdZRtyt96kiUiV/m1gbPhFy8lW
DKHK8tvtIoZObo5oNGvRxiuJPjefiChYZMDHokB+MY5ZALSRaIW9opj2WsM+ZAt7
g9CEjeitQcBY68CCpSEVBlQSz+BTZYEDFJGNCmGDxGhaBGZr/ganrkDJ75cG6U10
LnOZ6LDHNxGMqUm4wnhfmpHtVcIJiBF+gOyTumBPtFSoLnBverl684xizglpoq6d
fnUP8Y6XR9JA4OCZvo310yvX9nyqgb0H2h+APO0f7jRRcJo0QSKZ/qZR+XZCk3PU
91HtBopcGs8gGQUkRdAE7TMCiIEzL1eNOXHvsiILCObq1i7iNCe7Dzx6M6Gfgep8
a3IcoVmSw1DFpln2ZxTQw9viAib41iU46XHXz7W7rGPulF5QdXGo5ScORRG6HBLE
nZsXdTkrCPMJyN2bIhU6YJOK9rb9TjD4lUtnvzT8t1CfUsxQT4AsJykaYr9BwF2D
Z4rBruUAQ3OmvXJDfGG4T5YCAdPBN+xBcxCeyrDZSD0r6YPQGoF0dlvHmvP0J78D
oGkD1bb/gjcvsPJxtTQ4QWEh0oqZiDfRt4qdgO46vhba0onzQFw=
=X2zA
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- error handling fixup for one of the new ioctls from 1st pull
- fix for device-replace that incorrectly uses inode pages and can mess
up compressed extents in some cases
- fiemap fix for reporting incorrect number of extents
- vm_fault_t type conversion
* tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: Don't use inode pages for device replace
btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero
btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user
[BUG]
fm_mapped_extents is not correct when fm_extent_count is 0
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
When user space wants to get the number of file extents,
set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.
In the above example, fiemap will return with fm_mapped_extents set to 4,
but it should be 1 since there's only one entry in the output.
[REASON]
The problem seems to be that disko is only set if
fieinfo->fi_extents_max is set. And this member is initialized, in the
generic ioctl_fiemap function, to the value of used-passed
fm_extent_count. So when the user passes 0 then fi_extent_max is also
set to zero and this causes btrfs to not initialize disko at all.
Eventually this leads emit_fiemap_extent being called with a bogus
'phys' argument preventing proper fiemap entries merging.
[FIX]
Move the disko initialization earlier in extent_fiemap making it
independent of user-passed arguments, allowing emit_fiemap_extent to
properly handle consecutive extent entries.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsVV6wACgkQxWXV+ddt
WDsaOhAAnrG3o/Bkn4loifYjiWQoArswwVJ2m5MvG14oPcn/c6wCUlE6mHdEsT8l
WgltSRSd/3hJ1BcuWjhcR8RSLBRSaPcw4FAHikAdOWMrLXSKfA+quT/upXD00iDS
+bnHoknc5vvk+Ow0KyGODAeMYfJm8gGB43BeVDr/mgrVRrwYY9r14s+6LX897NOT
qObdxBjtedua0uIYZX5673siFaSbvJNQyGn6iKO9Ww+7bkWjgKO8qY8/3/gEaJn1
q1303Oo65VOXQQKZed+NjPNakpqRmABaN79qTwWMnUI4qYBKi651a7sr5HljDpmc
szY3uF/E7AOfJSDeTBdglFR1eQgv20ceeIx7FN0n6iUbuIVH9P85mBF6f76Hivpq
fA39uPYxWwWq8RxmC09yqoTkuNS0zAQhov4n8XO0q/81Gfek0RpR+LMK15W02Ur8
+5Pazkgz+xRYRPhpExe75pbM5N31KFJdf+4wpXeZoXKGTeiiCl/SZV3NzyUtemGg
eN7ltHD/Y9iNpx7sgHn+2veXwn6psDad8ORxIlRyKDeot2pNZ4lnekPbqMC/IwAz
hF3ZpcEo8v2Q2kqwCqw5evq4NSOvfG0cEuIqAKir1AgTn2nPH99qRSR1He9kluap
GBPTPXNgBrhhdsh5kltI6CJcGE46bHsLG4LI5WvRs+wOtyaG5Ww=
=jIvf
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible features:
- added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags,
successor of GET/SETFLAGS; now supports only existing flags:
append, immutable, noatime, nodump, sync
- 3 new unprivileged ioctls to allow users to enumerate subvolumes
- dedupe syscall implementation does not restrict the range to 16MiB,
though it still splits the whole range to 16MiB chunks
- on user demand, rmdir() is able to delete an empty subvolume,
export the capability in sysfs
- fix inode number types in tracepoints, other cleanups
- send: improved speed when dealing with a large removed directory,
measurements show decrease from 2000 minutes to 2 minutes on a
directory with 2 million entries
- pre-commit check of superblock to detect a mysterious in-memory
corruption
- log message updates
Other changes:
- orphan inode cleanup improved, does no keep long-standing
reservations that could lead up to early ENOSPC in some cases
- slight improvement of handling snapshotted NOCOW files by avoiding
some unnecessary tree searches
- avoid OOM when dealing with many unmergeable small extents at flush
time
- speedup conversion of free space tree representations from/to
bitmap/tree
- code refactoring, deletion, cleanups:
+ delayed refs
+ delayed iput
+ redundant argument removals
+ memory barrier cleanups
+ remove a redundant mutex supposedly excluding several ioctls to
run in parallel
- new tracepoints for blockgroup manipulation
- more sanity checks of compressed headers"
* tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (183 commits)
btrfs: Add unprivileged version of ino_lookup ioctl
btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF
btrfs: Add unprivileged ioctl which returns subvolume information
Btrfs: clean up error handling in btrfs_truncate()
btrfs: Factor out write portion of btrfs_get_blocks_direct
btrfs: Factor out read portion of btrfs_get_blocks_direct
btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
btrfs: raid56: Remove VLA usage
btrfs: return error value if create_io_em failed in cow_file_range
btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot
btrfs: drop unused parameter qgroup_reserved
btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
btrfs: lift some btrfs_cross_ref_exist checks in nocow path
btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
btrfs: Remove fs_info argument from btrfs_uuid_tree_add
Btrfs: remove unused check of skip_locking
Btrfs: remove always true check in unlock_up
Btrfs: grab write lock directly if write_lock_level is the max level
Btrfs: move get root out of btrfs_search_slot to a helper
Btrfs: use more straightforward extent_buffer_uptodate check
...
Convert btrfs to embedded bio sets.
Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can be directly referenced from the passed address_space so do that.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used in the function so just remove it. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
=I0y0
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"We have queued a few more fixes (error handling, log replay,
softlockup) and the rest is SPDX updates that touche almost all files
so the diffstat is long"
* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Only check first key for committed tree blocks
btrfs: add SPDX header to Kconfig
btrfs: replace GPL boilerplate by SPDX -- sources
btrfs: replace GPL boilerplate by SPDX -- headers
Btrfs: fix loss of prealloc extents past i_size after fsync log replay
Btrfs: clean up resources during umount after trans is aborted
btrfs: Fix possible softlock on single core machines
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The missing error handling in add_extent_changeset was hidden, so make
it at least visible in the callers.
Signed-off-by: David Sterba <dsterba@suse.com>
The merge call was factored out to a separate helper but it's a trivial
one and arguably we can opencode it and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass a valid pointer so we can drop the redundant checks.
The call to submit_one_bio never happend and can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.
Before:
text data bss dec hex filename
1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko
After:
text data bss dec hex filename
1131090 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
=Irul
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features or user visible changes:
- fallocate: implement zero range mode
- avoid losing data raid profile when deleting a device
- tree item checker: more checks for directory items and xattrs
Notable fixes:
- raid56 recovery: don't use cached stripes, that could be
potentially changed and a later RMW or recovery would lead to
corruptions or failures
- let raid56 try harder to rebuild damaged data, reading from all
stripes if necessary
- fix scrub to repair raid56 in a similar way as in the case above
Other:
- cleanups: device freeing, removed some call indirections, redundant
bio_put/_get, unused parameters, refactorings and renames
- RCU list traversal fixups
- simplify mount callchain, remove recursing back when mounting a
subvolume
- plug for fsync, may improve bio merging on multiple devices
- compression heurisic: replace heap sort with radix sort, gains some
performance
- add extent map selftests, buffered write vs dio"
* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
btrfs: drop devid as device_list_add() arg
btrfs: get device pointer from device_list_add()
btrfs: set the total_devices in device_list_add()
btrfs: move pr_info into device_list_add
btrfs: make btrfs_free_stale_devices() to match the path
btrfs: rename btrfs_free_stale_devices() arg to skip_dev
btrfs: make btrfs_free_stale_devices() argument optional
btrfs: make btrfs_free_stale_device() to iterate all stales
btrfs: no need to check for btrfs_fs_devices::seeding
btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
Btrfs: noinline merge_extent_mapping
Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
Btrfs: extent map selftest: dio write vs dio read
Btrfs: extent map selftest: buffered write vs dio read
Btrfs: add extent map selftests
Btrfs: move extent map specific code to extent_map.c
Btrfs: add helper for em merge logic
Btrfs: fix unexpected EEXIST from btrfs_get_extent
Btrfs: fix incorrect block_len in merge_extent_mapping
btrfs: Remove unused readahead spinlock
...
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.
Signed-off-by: David Sterba <dsterba@suse.com>
The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.
Signed-off-by: David Sterba <dsterba@suse.com>
The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.
Signed-off-by: David Sterba <dsterba@suse.com>
flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.
Signed-off-by: David Sterba <dsterba@suse.com>
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9036c10208 ("Btrfs: update hole handling v2") added the
FLAG_VACANCY to denote holes, however there was already a consistent way
of flagging extents which represent hole - ->block_start =
EXTENT_MAP_HOLE. And also the only place where this flag is checked is
in the fiemap code, but the block_start value is also checked and every
other place in the filesystem detects holes by using block_start
value's. So remove the extra flag. This survived a full xfstest run.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass btrfs_get_extent_fiemap and get_extent_skip_holes
itself is used only as a fiemap helper.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.
Signed-off-by: David Sterba <dsterba@suse.com>
Previous patches cleaned up all places where
extent_page_data::get_extent was set and it was btrfs_get_extent all the
time, so we can simply call that instead.
This also reduces size of extent_page_data by 8 bytes which has positive
effect on stack consumption on various functions on the write out path.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Use __clear_extent_bit directly in case we want to pass unknown
gfp flags. Otherwise all clear_extent_bit callers use GFP_NOFS, so we
can sink them to the function and reduce argument count, at the cost
that __clear_extent_bit has to be exported.
Signed-off-by: David Sterba <dsterba@suse.com>
The following callpath is always invoked with mirror_num set to 0, so
let's remove it as an argument and directly pass 0 to __do_redpage. No
functional change.
extent_readpages
__extent_readpages
__do_contiguous_readpages
__do_readpage
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.
Use bio_nr_pages() to do that instead.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlofBpkACgkQxWXV+ddt
WDvtTQ//emI1QsD4N0e4BxMcZ1bcigiEk3jc4gj+biRapnMHHAHOqJbVtpK1v8gS
PCTw+4uD5UOGvhBtS4kXJn8e2qxWCESWJDXwVlW0RHmWLfwd9z7ly0sBMi3oiIqH
qief8CIkk3oTexiuuJ3mZGxqnDQjRGtWx2LM+bRJBWMk+jN32v2ObSlv9V505a5M
1daDBsjWojFWa8d4r3YZNJq1df2om/dwVQZ0Wk59bacIo9Xbvok0X459cOlWuv0p
mjx8m8uA/z+HVdkTYlzyKpq08O8Z4shj3GrBbSnZ511gKzV+c9jJPxij5pKm3Z2z
KW4Mp17+/7GSNcSsJiqnOYi+wtOrak2lD0COlZTijnY2jrv18h8ianoIM6CpzUdy
+b09yuFXbPLoUfyl6vFaO/JHuvAkQdaR2tJbds6lvW+liC1ReoL4W1WcUjY6nv9f
6wTaIv0vwgrHaxeIzxKNpnsTlpHAgorFFk0/w8nLb40WX8AoJ/95lo2zws8oaFDN
0Fylu3NYhoDrJZK+D8dbsWx2eTsFVCqep4w0+iEVZl3lfuy3FZl1pu8CL7ru9vJl
DNieh+lUvK1Fk+SYIoilGoriW96RbU8+jPo2W4A1ENzeMJfrNCSWtUSZZp4XT4tO
8m1PGud07XBLSxd62bAEDV3KZO2DnY1WxgXbKuIHSi9D5CI1LMo=
=7UW+
-----END PGP SIGNATURE-----
Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We've collected some fixes in since the pre-merge window freeze.
There's technically only one regression fix for 4.15, but the rest
seems important and candidates for stable.
- fix missing flush bio puts in error cases (is serious, but rarely
happens)
- fix reporting stat::st_blocks for buffered append writes
- fix space cache invalidation
- fix out of bound memory access when setting zlib level
- fix potential memory corruption when fsync fails in the middle
- fix crash in integrity checker
- incremetnal send fix, path mixup for certain unlink/rename
combination
- pass flags to writeback so compressed writes can be throttled
properly
- error handling fixes"
* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: incremental send, fix wrong unlink path after renaming file
btrfs: tree-checker: Fix false panic for sanity test
Btrfs: fix list_add corruption and soft lockups in fsync
btrfs: Fix wild memory access in compression level parser
btrfs: fix deadlock when writing out space cache
btrfs: clear space cache inode generation always
Btrfs: fix reported number of inode blocks after buffered append writes
Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Btrfs: bail out gracefully rather than BUG_ON
btrfs: dev_alloc_list is not protected by RCU, use normal list_del
btrfs: add missing device::flush_bio puts
btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
Btrfs: add write_flags for compression bio
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.
Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.
It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.
This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from David Sterba:
"There are some new user features and the usual load of invisible
enhancements or cleanups.
New features:
- extend mount options to specify zlib compression level, -o
compress=zlib:9
- v2 of ioctl "extent to inode mapping", addressing a usecase where
we want to retrieve more but inaccurate results and do the
postprocessing in userspace, aiding defragmentation or
deduplication tools
- populate compression heuristics logic, do data sampling and try to
guess compressibility by: looking for repeated patterns, counting
unique byte values and distribution, calculating Shannon entropy;
this will need more benchmarking and possibly fine tuning, but the
base should be good enough
- enable indexing for btrfs as lower filesystem in overlayfs
- speedup page cache readahead during send on large files
Internal enhancements:
- more sanity checks of b-tree items when reading them from disk
- more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
errno or error handling fixes
- remove some homegrown IO-related logic, that's been obsoleted by
core block layer changes (batching, plug/unplug, own counters)
- add ref-verify, optional debugging feature to verify extent
reference accounting
- simplify code handling outstanding extents, make it more clear
where and how the accounting is done
- make delalloc reservations per-inode, simplify the code and make
the logic more straightforward
- extensive cleanup of delayed refs code
Notable fixes:
- fix send ioctl on 32bit with 64bit kernel"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
Btrfs: heuristic: add Shannon entropy calculation
Btrfs: heuristic: add byte core set calculation
Btrfs: heuristic: add byte set calculation
Btrfs: heuristic: add detection of repeated data patterns
Btrfs: heuristic: implement sampling logic
Btrfs: heuristic: add bucket and sample counters and other defines
Btrfs: compression: separate heuristic/compression workspaces
btrfs: move btrfs_truncate_block out of trans handle
btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
btrfs: track refs in a rb_tree instead of a list
btrfs: add a comp_refs() helper
btrfs: switch args for comp_*_refs
btrfs: make the delalloc block rsv per inode
btrfs: add tracepoints for outstanding extents mods
Btrfs: rework outstanding_extents
btrfs: increase output size for LOGICAL_INO_V2 ioctl
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
btrfs: send: remove unused code
...
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The use of sector_t in the callchain of submit_extent_page is not
necessary. Switch to u64 and rename the variable and use byte units
instead of 512b, ie. dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to remove sector_t and will use 'offset', so this patch
frees the name.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.
The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.
The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"
* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.
In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).
I added a cast to u64 to avoid overflow.
Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
"We've collected a bunch of isolated fixes, for crashes, user-visible
behaviour or missing bits from other subsystem cleanups from the past.
The overall number is not small but I was not able to make it
significantly smaller. Most of the patches are supposed to go to
stable"
* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: log csums for all modified extents
Btrfs: fix unexpected result when dio reading corrupted blocks
btrfs: Report error on removing qgroup if del_qgroup_item fails
Btrfs: skip checksum when reading compressed data if some IO have failed
Btrfs: fix kernel oops while reading compressed data
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
Btrfs: do not backup tree roots when fsync
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
Btrfs: send: fix error number for unknown inode types
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: finish ordered extent cleaning if no progress is found
btrfs: clear ordered flag on cleaning up ordered extents
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
Btrfs: do not reset bio->bi_ops while writing bio
Btrfs: use the new helper wbc_to_write_flags
flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().
This remove this unnecessary bio->bi_opf setting.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.
Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Pull btrfs updates from David Sterba:
"The changes range through all types: cleanups, core chagnes, sanity
checks, fixes, other user visible changes, detailed list below:
- deprecated: user transaction ioctl
- mount option ssd does not change allocation alignments
- degraded read-write mount is allowed if all the raid profile
constraints are met, now based on more accurate check
- defrag: do not reset compression afterwards; the NOCOMPRESS flag
can be now overriden by defrag
- prep work for better extent reference tracking (related to the
qgroup slowness with balance)
- prep work for compression heuristics
- memory allocation reductions (may help latencies on a loaded
system)
- better accounting for io waiting states
- error handling improvements (removed BUGs)
- added more sanity checks for shared refs
- fix readdir vs pagefault deadlock under some circumstances
- fix for 'no-hole' mode, certain combination of compressed and
inline extents
- send: fix emission of invalid clone operations
- fixup file mode if setting acls fail
- more fixes from fuzzing
- oher cleanups"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
btrfs: submit superblock io with REQ_META and REQ_PRIO
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
btrfs: pass fs_info to btrfs_del_root instead of tree_root
Btrfs: add one more sanity check for shared ref type
Btrfs: remove BUG_ON in __add_tree_block
Btrfs: remove BUG() in add_data_reference
Btrfs: remove BUG() in print_extent_item
Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Btrfs: convert to use btrfs_get_extent_inline_ref_type
Btrfs: add a helper to retrive extent inline ref type
btrfs: scrub: simplify scrub worker initialization
btrfs: scrub: clean up division in scrub_find_csum
btrfs: scrub: clean up division in __scrub_mark_bitmap
btrfs: scrub: use bool for flush_all_writes
btrfs: preserve i_mode if __btrfs_set_acl() fails
btrfs: Remove extraneous chunk_objectid variable
btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs: Remove extra parentheses from condition in copy_items()
...
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.
if (start < eb->len) and (start + len > eb->len),
then
a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,
b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).
The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.
It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref". And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.
This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.
Signed-off-by: David Sterba <dsterba@suse.com>
This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.
btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3ab
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit
btrfs_is_parity_mirror's mirror_num - same as above
chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c ("Btrfs:
devid subset filter") and never used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.
No impact on code, but serves as documentation that a buffer is intended
not to be modified.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Firstly by applying the following with coccinelle's spatch:
@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)
to effect the conversion to sb_rdonly(sb), then by applying:
@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)
@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)
to remove left over excess bracketage and finally by applying:
@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)
to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull btrfs fixes from David Sterba:
"We've identified and fixed a silent corruption (introduced by code in
the first pull), a fixup after the blk_status_t merge and two fixes to
incremental send that Filipe has been hunting for some time"
* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unexpected return value of bio_readpage_error
btrfs: btrfs_create_repair_bio never fails, skip error handling
btrfs: cloned bios must not be iterated by bio_for_each_segment_all
Btrfs: fix write corruption due to bio cloning on raid5/6
Btrfs: incremental send, fix invalid memory access
Btrfs: incremental send, fix invalid path for link commands
With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.
This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration. Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions. This patch adds assertions to all remaining
bio_for_each_segment_all cases.
[1] https://patchwork.kernel.org/patch/9838535/
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull percpu updates from Tejun Heo:
"These are the percpu changes for the v4.13-rc1 merge window. There are
a couple visibility related changes - tracepoints and allocator stats
through debugfs, along with __ro_after_init markings and a cosmetic
rename in percpu_counter.
Please note that the simple O(#elements_in_the_chunk) area allocator
used by percpu allocator is again showing scalability issues,
primarily with bpf allocating and freeing large number of counters.
Dennis is working on the replacement allocator and the percpu
allocator will be seeing increased churns in the coming cycles"
* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix static checker warnings in pcpu_destroy_chunk
percpu: fix early calls for spinlock in pcpu_stats
percpu: resolve err may not be initialized in pcpu_alloc
percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
percpu: add tracepoint support for percpu memory
percpu: expose statistics about percpu memory via debugfs
percpu: migrate percpu data structures to internal header
percpu: add missing lockdep_assert_held to func pcpu_free_area
mark most percpu globals as __ro_after_init
Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.
User visible changes:
- statx support
- quota override tunable
- improved compression thresholds
- obsoleted mount option alloc_start
Core updates:
- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios
- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates
- prep work for btree_inode removal
- dir-item validation
- qgoup fixes and updates
- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"
* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...
Commit 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.
However such warning doesn't take the following case into consideration:
0 4K 8K
|<---- fiemap range --->|
|<----------- On-disk extent ------------------>|
In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.
This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.
Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Fixes: 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable. Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add. In terms of context-safety,
they're equivalent. The only difference is that the __ version takes
a batch parameter.
Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.
This patch doesn't cause any functional changes.
tj: Minor updates to patch description for clarity. Cosmetic
indentation updates.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.
submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph pointed out that bio allocations backed by a bioset will never
fail. As we always use a bioset for all bio allocations, we can skip
the error handling. This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.
CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5f39d397df ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.
Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio. Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.
Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it. This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead pass around the failure tree and the io tree.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.
Detected by CoverityScan, CID#1414312 ("Logically dead code")
Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>