Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.
So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.
But later on it is overwritten with either the offset from the extent
buffer's start or 0.
So get rid of the initial initialization.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.
This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.
This shrinks struct btrfs_io_bio by 8 bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions in BTRFS are only used inside the source file they are
declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if
CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared
with the unit tests code.
Before the introduction of the EXPORT_FOR_TESTS macro, these functions
could not be declared as static and the compiler had a harder task when
optimizing and inlining them.
As we have EXPORT_FOR_TESTS now, use it where appropriate to support the
compiler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
lock_delalloc_pages should only return 2 values - 0 in case of success
and -EAGAIN if the range of pages to be locked should be shrunk due to
some of gone. Manual inspections confirms that this is indeed the case
since __process_pages_contig is where lock_delalloc_pages gets its
return value. The latter always returns 0 or -EAGAIN so the invariant
holds. No functional changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's
reduce the argument count and make that a local variable. No functional
changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Snapshot is expected to be fast. But if there are writers steadily
creating dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in the generic write_cache_pages(), so we can
omit pages dirtied after the snapshot command.
This does not change the semantics regarding which data get to the
snapshot, if there are pages being dirtied during the snapshotting
operation. There's a sync called before snapshot is taken in old/new
case, any IO in flight just after that may be in the snapshot but this
depends on other system effects that might still sync the IO.
We do a simple snapshot speed test on a Intel D-1531 box:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 1m58sec
patched: 6.54sec
This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.
For a multi writers, random write test:
fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
original: 15.83sec
patched: 10.35sec
The improvement is smaller compared to the sequential write case,
since we omit only half of the pages dirtied after snapshot command.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only remaining use of the 'epd' argument in writepage_delalloc is
to reference the extent_io_tree which was set in extent_writepages. Since
it is guaranteed that page->mapping of any page passed to
writepage_delalloc (and __extent_writepage as the sole caller) to be
equal to that passed in extent_writepages we can directly get the
io_tree via the already passed inode (which is also taken from
page->mapping->host). No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If epd::extent_locked is set then writepage_delalloc terminates. Make
this a bit more apparent in the caller by simply bubbling the check up.
This enables to remove epd as an argument to writepage_delalloc in a
future patch. No functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loop construct in free_extent_buffer was added in
242e18c7c1 ("Btrfs: reduce lock contention on extent buffer locks")
as means of reducing the times the eb lock is taken, the non-last ref
count is decremented and lock is released. As the special handling
of UNMAPPED extent buffers was removed now there is only one decrement
op which is happening for EXTENT_BUFFER_UNMAPPED case.
This commit modifies the loop condition so that in case of UNMAPPED
buffers the eb's lock is taken only if we are 100% sure the eb is going
to be freed by the current executor of the code. Additionally, remove
superfluous ref count ops in btrfs test.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the whole of btrfs code has been audited for eb reference count
management it's time to remove the hunk in free_extent_buffer that
essentially considered the condition
"eb->ref == 2 && EXTENT_BUFFER_DUMMY"
to equal "eb->ref = 1". Also remove the last location
which takes an extra reference count in alloc_test_extent_buffer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.
This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures. No
functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.
Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
a_ops->readpages() is only ever used for read-ahead. Ensure that we
pass this information down to the block layer.
Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The data and metadata callback implementation both use the same
function. We can remove the call indirection and intermediate helper
completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All implementations of the callback are trivial and do the same and
there's only one user. Merge everything together.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_BUFFER_DUMMY is an awful name for this flag. Buffers which have
this flag set are not in any way dummy. Rather, they are private in the
sense that are not mapped and linked to the global buffer tree. This
flag has subtle implications to the way free_extent_buffer works for
example, as well as controls whether page->mapping->private_lock is held
during extent_buffer release. Pages for an unmapped buffer cannot be
under io, nor can they be written by a 3rd party so taking the lock is
unnecessary.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ EXTENT_BUFFER_UNMAPPED, update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page"). Update the name and comment.
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.
The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").
However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltSwkIACgkQxWXV+ddt
WDtMBQ//UXMjHaXvFmC0SM6NczuYQR51hLYtJFIKig93XK5goVpUBTNbxO7LX/Tn
4zKoKyVhkW1884V6mRiC+G23QLbo0BQZA7DExfyJ3jylQdjZMBm+K+r19OtGQf5v
CII7oUwni03KIiXIqiFAL5dLWebVQpG5EKJbh8GLZsmg6xNcyVaUqZ/fHXajbZiv
ldEBtHBKIv7WWTJmylMBKMWnRz+jqU91fXPahoU6R5qivODrLt1o/PMuSjVNhaxe
iDldHfdOaiQmLHB/1kOGyv492oW5mSSVNDE8LjEDZ61tDNlAcUyuKUWIRBxDEDtD
6D7rlVQXJ/N7sJ6+UYmJKsRpHL+NOkyzSZ0QEU/sm1Xpm8gkhHuuofRPrVCtd3l1
ZSbwvlrdyjigVEBfM3IbToQ/K6Rc1ZGId20OAs9PCQbb+mj9IxPIncZ7pI1c4hlh
pPEjcYsp14JbCTjctFalcqTiFY5tHRQsx+GUFnDyOcdL7Mi+CoH+0Jy61Vgz9GQE
7s934cfEC0ot/f66kAL/PZzxUfC7TePqaa+sDfS5BIkJ4M6lPMxS5De5R4Z0+Nzr
DXgQAlgXmxfRjpOYMTH9D0EDdSeJaNmVHgk7hFbiYk/KX3oyd4NmgI9Cfao8rQJv
2yd8wF2httfSJKD4b/Hv9r6Ho/Bw9PK59BvWOKYhSj6IGl32utw=
=f7eB
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fix of a corruption regarding fsync and clone, under some very
specific conditions explained in the patch.
The fix is marked for stable 3.16+ so I'd like to get it merged now
given the impact"
* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix file data corruption after cloning a range and fsync
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).
The following example, turned into a test for fstests, reproduces the
issue:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure
In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.
Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).
CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
=fCdm
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes and an incorrect error value propagation fix from
'rename exchange'"
* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix return value on rename exchange failure
btrfs: fix invalid-free in btrfs_extent_same
Btrfs: fix physical offset reported by fiemap for inline extents
Commit 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when
fm_extent_count is zero") introduced a regression where we no longer
report 0 as the physical offset for inline extents (and other extents
with a special block_start value). This is because it always sets the
variable used to report the physical offset ("disko") as em->block_start
plus some offset, and em->block_start has the value 18446744073709551614
((u64) -2) for inline extents.
This made the btrfs test 004 (from fstests) often fail, for example, for
a file with an inline extent we have the following items in the subvolume
tree:
item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160
generation 25 transid 38 size 1525 nbytes 1525
block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x2(none)
atime 1529342058.461891730 (2018-06-18 18:14:18)
ctime 1529342058.461891730 (2018-06-18 18:14:18)
mtime 1529342058.461891730 (2018-06-18 18:14:18)
otime 1529342055.869892885 (2018-06-18 18:14:15)
item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13
index 25 namelen 3 name: fc7
item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546
generation 38 type 0 (inline)
inline extent data size 1525 ram_bytes 1525 compression 0 (none)
Then when test 004 invoked fiemap against the file it got a non-zero
physical offset:
$ filefrag -v /mnt/p0/d4/d7/fc7
Filesystem type is: 9123683e
File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 4095: 18446744073709551614.. 4093: 4096: last,not_aligned,inline,eof
/mnt/p0/d4/d7/fc7: 1 extent found
This resulted in the test failing like this:
btrfs/004 49s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad)
--- tests/btrfs/004.out 2016-08-23 10:17:35.027012095 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad 2018-06-18 18:15:02.385872155 +0100
@@ -1,3 +1,10 @@
QA output created by 004
*** test backref walking
-*** done
+./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer expression expected
+ERROR: 7.55578637259143e+22 is not a valid numeric value.
+unexpected output from
+ /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -s 65536 -P 7.55578637259143e+22 /home/fdmanana/btrfs-tests/scratch_1
...
(Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad' to see the entire diff)
Ran: btrfs/004
The large number in scientific notation reported as an invalid numeric
value is the result from the filter passed to perl which multiplies the
physical offset by the block size reported by fiemap.
So fix this by ensuring the physical offset is always set to 0 when we
are processing an extent with a special block_start value.
Fixes: 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsg598ACgkQxWXV+ddt
WDtG1w//R9/nvAk5A5cbForTXNyxwXAQnz0t2/4Lh5igbfJloqoTZtr47Gvqsvy+
DITU+3BPcyupBuUFLoeivPC+ruKOUmle27Vm62mdZRtyt96kiUiV/m1gbPhFy8lW
DKHK8tvtIoZObo5oNGvRxiuJPjefiChYZMDHokB+MY5ZALSRaIW9opj2WsM+ZAt7
g9CEjeitQcBY68CCpSEVBlQSz+BTZYEDFJGNCmGDxGhaBGZr/ganrkDJ75cG6U10
LnOZ6LDHNxGMqUm4wnhfmpHtVcIJiBF+gOyTumBPtFSoLnBverl684xizglpoq6d
fnUP8Y6XR9JA4OCZvo310yvX9nyqgb0H2h+APO0f7jRRcJo0QSKZ/qZR+XZCk3PU
91HtBopcGs8gGQUkRdAE7TMCiIEzL1eNOXHvsiILCObq1i7iNCe7Dzx6M6Gfgep8
a3IcoVmSw1DFpln2ZxTQw9viAib41iU46XHXz7W7rGPulF5QdXGo5ScORRG6HBLE
nZsXdTkrCPMJyN2bIhU6YJOK9rb9TjD4lUtnvzT8t1CfUsxQT4AsJykaYr9BwF2D
Z4rBruUAQ3OmvXJDfGG4T5YCAdPBN+xBcxCeyrDZSD0r6YPQGoF0dlvHmvP0J78D
oGkD1bb/gjcvsPJxtTQ4QWEh0oqZiDfRt4qdgO46vhba0onzQFw=
=X2zA
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- error handling fixup for one of the new ioctls from 1st pull
- fix for device-replace that incorrectly uses inode pages and can mess
up compressed extents in some cases
- fiemap fix for reporting incorrect number of extents
- vm_fault_t type conversion
* tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: Don't use inode pages for device replace
btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero
btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user
[BUG]
fm_mapped_extents is not correct when fm_extent_count is 0
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/btrfs/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
When user space wants to get the number of file extents,
set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.
In the above example, fiemap will return with fm_mapped_extents set to 4,
but it should be 1 since there's only one entry in the output.
[REASON]
The problem seems to be that disko is only set if
fieinfo->fi_extents_max is set. And this member is initialized, in the
generic ioctl_fiemap function, to the value of used-passed
fm_extent_count. So when the user passes 0 then fi_extent_max is also
set to zero and this causes btrfs to not initialize disko at all.
Eventually this leads emit_fiemap_extent being called with a bogus
'phys' argument preventing proper fiemap entries merging.
[FIX]
Move the disko initialization earlier in extent_fiemap making it
independent of user-passed arguments, allowing emit_fiemap_extent to
properly handle consecutive extent entries.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsVV6wACgkQxWXV+ddt
WDsaOhAAnrG3o/Bkn4loifYjiWQoArswwVJ2m5MvG14oPcn/c6wCUlE6mHdEsT8l
WgltSRSd/3hJ1BcuWjhcR8RSLBRSaPcw4FAHikAdOWMrLXSKfA+quT/upXD00iDS
+bnHoknc5vvk+Ow0KyGODAeMYfJm8gGB43BeVDr/mgrVRrwYY9r14s+6LX897NOT
qObdxBjtedua0uIYZX5673siFaSbvJNQyGn6iKO9Ww+7bkWjgKO8qY8/3/gEaJn1
q1303Oo65VOXQQKZed+NjPNakpqRmABaN79qTwWMnUI4qYBKi651a7sr5HljDpmc
szY3uF/E7AOfJSDeTBdglFR1eQgv20ceeIx7FN0n6iUbuIVH9P85mBF6f76Hivpq
fA39uPYxWwWq8RxmC09yqoTkuNS0zAQhov4n8XO0q/81Gfek0RpR+LMK15W02Ur8
+5Pazkgz+xRYRPhpExe75pbM5N31KFJdf+4wpXeZoXKGTeiiCl/SZV3NzyUtemGg
eN7ltHD/Y9iNpx7sgHn+2veXwn6psDad8ORxIlRyKDeot2pNZ4lnekPbqMC/IwAz
hF3ZpcEo8v2Q2kqwCqw5evq4NSOvfG0cEuIqAKir1AgTn2nPH99qRSR1He9kluap
GBPTPXNgBrhhdsh5kltI6CJcGE46bHsLG4LI5WvRs+wOtyaG5Ww=
=jIvf
-----END PGP SIGNATURE-----
Merge tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible features:
- added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags,
successor of GET/SETFLAGS; now supports only existing flags:
append, immutable, noatime, nodump, sync
- 3 new unprivileged ioctls to allow users to enumerate subvolumes
- dedupe syscall implementation does not restrict the range to 16MiB,
though it still splits the whole range to 16MiB chunks
- on user demand, rmdir() is able to delete an empty subvolume,
export the capability in sysfs
- fix inode number types in tracepoints, other cleanups
- send: improved speed when dealing with a large removed directory,
measurements show decrease from 2000 minutes to 2 minutes on a
directory with 2 million entries
- pre-commit check of superblock to detect a mysterious in-memory
corruption
- log message updates
Other changes:
- orphan inode cleanup improved, does no keep long-standing
reservations that could lead up to early ENOSPC in some cases
- slight improvement of handling snapshotted NOCOW files by avoiding
some unnecessary tree searches
- avoid OOM when dealing with many unmergeable small extents at flush
time
- speedup conversion of free space tree representations from/to
bitmap/tree
- code refactoring, deletion, cleanups:
+ delayed refs
+ delayed iput
+ redundant argument removals
+ memory barrier cleanups
+ remove a redundant mutex supposedly excluding several ioctls to
run in parallel
- new tracepoints for blockgroup manipulation
- more sanity checks of compressed headers"
* tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (183 commits)
btrfs: Add unprivileged version of ino_lookup ioctl
btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF
btrfs: Add unprivileged ioctl which returns subvolume information
Btrfs: clean up error handling in btrfs_truncate()
btrfs: Factor out write portion of btrfs_get_blocks_direct
btrfs: Factor out read portion of btrfs_get_blocks_direct
btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
btrfs: raid56: Remove VLA usage
btrfs: return error value if create_io_em failed in cow_file_range
btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot
btrfs: drop unused parameter qgroup_reserved
btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
btrfs: lift some btrfs_cross_ref_exist checks in nocow path
btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
btrfs: Remove fs_info argument from btrfs_uuid_tree_add
Btrfs: remove unused check of skip_locking
Btrfs: remove always true check in unlock_up
Btrfs: grab write lock directly if write_lock_level is the max level
Btrfs: move get root out of btrfs_search_slot to a helper
Btrfs: use more straightforward extent_buffer_uptodate check
...
Convert btrfs to embedded bio sets.
Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can be directly referenced from the passed address_space so do that.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used in the function so just remove it. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
=I0y0
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"We have queued a few more fixes (error handling, log replay,
softlockup) and the rest is SPDX updates that touche almost all files
so the diffstat is long"
* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Only check first key for committed tree blocks
btrfs: add SPDX header to Kconfig
btrfs: replace GPL boilerplate by SPDX -- sources
btrfs: replace GPL boilerplate by SPDX -- headers
Btrfs: fix loss of prealloc extents past i_size after fsync log replay
Btrfs: clean up resources during umount after trans is aborted
btrfs: Fix possible softlock on single core machines
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The missing error handling in add_extent_changeset was hidden, so make
it at least visible in the callers.
Signed-off-by: David Sterba <dsterba@suse.com>
The merge call was factored out to a separate helper but it's a trivial
one and arguably we can opencode it and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass a valid pointer so we can drop the redundant checks.
The call to submit_one_bio never happend and can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
extent_buffer_uptodate() is a trivial wrapper around test_bit() and
nothing else. So make it static and inline, save on code space and call
indirection.
Before:
text data bss dec hex filename
1131257 82898 18992 1233147 12d0fb fs/btrfs/btrfs.ko
After:
text data bss dec hex filename
1131090 82898 18992 1232980 12d054 fs/btrfs/btrfs.ko
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This prints out eb->bflags since it contains some useful information,
e.g. whether eb is dirty.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
=Irul
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features or user visible changes:
- fallocate: implement zero range mode
- avoid losing data raid profile when deleting a device
- tree item checker: more checks for directory items and xattrs
Notable fixes:
- raid56 recovery: don't use cached stripes, that could be
potentially changed and a later RMW or recovery would lead to
corruptions or failures
- let raid56 try harder to rebuild damaged data, reading from all
stripes if necessary
- fix scrub to repair raid56 in a similar way as in the case above
Other:
- cleanups: device freeing, removed some call indirections, redundant
bio_put/_get, unused parameters, refactorings and renames
- RCU list traversal fixups
- simplify mount callchain, remove recursing back when mounting a
subvolume
- plug for fsync, may improve bio merging on multiple devices
- compression heurisic: replace heap sort with radix sort, gains some
performance
- add extent map selftests, buffered write vs dio"
* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
btrfs: drop devid as device_list_add() arg
btrfs: get device pointer from device_list_add()
btrfs: set the total_devices in device_list_add()
btrfs: move pr_info into device_list_add
btrfs: make btrfs_free_stale_devices() to match the path
btrfs: rename btrfs_free_stale_devices() arg to skip_dev
btrfs: make btrfs_free_stale_devices() argument optional
btrfs: make btrfs_free_stale_device() to iterate all stales
btrfs: no need to check for btrfs_fs_devices::seeding
btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
Btrfs: noinline merge_extent_mapping
Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
Btrfs: extent map selftest: dio write vs dio read
Btrfs: extent map selftest: buffered write vs dio read
Btrfs: add extent map selftests
Btrfs: move extent map specific code to extent_map.c
Btrfs: add helper for em merge logic
Btrfs: fix unexpected EEXIST from btrfs_get_extent
Btrfs: fix incorrect block_len in merge_extent_mapping
btrfs: Remove unused readahead spinlock
...
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one instance where we pass different gfp mask to
unlock_extent_cached. Add a separate helper for that and then we can
drop the gfp parameter from unlock_extent_cached.
Signed-off-by: David Sterba <dsterba@suse.com>
The bio is never referenced after it has been submitted so there is no
point in getting an extra reference.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions called from extent_write_cache_pages used void* as generic
callback data, but all of them convert it to extent_page_data, or use it
directly.
Signed-off-by: David Sterba <dsterba@suse.com>
The function extent_write_cache_pages is modelled after
write_cache_pages which is a generic interface and the writepage
parameter makes sense there. In btrfs we know exactly which callback
we're going to use, so we can pass it directly.
Signed-off-by: David Sterba <dsterba@suse.com>
flush_epd_write_bio is same as flush_write_bio, no point having two such
functions. Merge them to flush_write_bio. The 'noinline' attribute is
removed as it does not have any meaning.
Signed-off-by: David Sterba <dsterba@suse.com>
The tree argument passed to extent_write_full_page is referenced from
the page being passed to the same function. Since we already have
enough information to get the reference, remove the function parameter.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is called only from submit_compressed_extents and the
io tree being passed is always that of the inode. But we are also
passing the inode, so just move getting the io tree pointer in
extent_write_locked_range to simplify the signature.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 9036c10208 ("Btrfs: update hole handling v2") added the
FLAG_VACANCY to denote holes, however there was already a consistent way
of flagging extents which represent hole - ->block_start =
EXTENT_MAP_HOLE. And also the only place where this flag is checked is
in the fiemap code, but the block_start value is also checked and every
other place in the filesystem detects holes by using block_start
value's. So remove the extra flag. This survived a full xfstest run.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass btrfs_get_extent_fiemap and get_extent_skip_holes
itself is used only as a fiemap helper.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass btrfs_get_extent_fiemap and we don't expect anything
else in the context of extent_fiemap.
Signed-off-by: David Sterba <dsterba@suse.com>
Previous patches cleaned up all places where
extent_page_data::get_extent was set and it was btrfs_get_extent all the
time, so we can simply call that instead.
This also reduces size of extent_page_data by 8 bytes which has positive
effect on stack consumption on various functions on the write out path.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Use __clear_extent_bit directly in case we want to pass unknown
gfp flags. Otherwise all clear_extent_bit callers use GFP_NOFS, so we
can sink them to the function and reduce argument count, at the cost
that __clear_extent_bit has to be exported.
Signed-off-by: David Sterba <dsterba@suse.com>
The following callpath is always invoked with mirror_num set to 0, so
let's remove it as an argument and directly pass 0 to __do_redpage. No
functional change.
extent_readpages
__extent_readpages
__do_contiguous_readpages
__do_readpage
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.
Use bio_nr_pages() to do that instead.
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlofBpkACgkQxWXV+ddt
WDvtTQ//emI1QsD4N0e4BxMcZ1bcigiEk3jc4gj+biRapnMHHAHOqJbVtpK1v8gS
PCTw+4uD5UOGvhBtS4kXJn8e2qxWCESWJDXwVlW0RHmWLfwd9z7ly0sBMi3oiIqH
qief8CIkk3oTexiuuJ3mZGxqnDQjRGtWx2LM+bRJBWMk+jN32v2ObSlv9V505a5M
1daDBsjWojFWa8d4r3YZNJq1df2om/dwVQZ0Wk59bacIo9Xbvok0X459cOlWuv0p
mjx8m8uA/z+HVdkTYlzyKpq08O8Z4shj3GrBbSnZ511gKzV+c9jJPxij5pKm3Z2z
KW4Mp17+/7GSNcSsJiqnOYi+wtOrak2lD0COlZTijnY2jrv18h8ianoIM6CpzUdy
+b09yuFXbPLoUfyl6vFaO/JHuvAkQdaR2tJbds6lvW+liC1ReoL4W1WcUjY6nv9f
6wTaIv0vwgrHaxeIzxKNpnsTlpHAgorFFk0/w8nLb40WX8AoJ/95lo2zws8oaFDN
0Fylu3NYhoDrJZK+D8dbsWx2eTsFVCqep4w0+iEVZl3lfuy3FZl1pu8CL7ru9vJl
DNieh+lUvK1Fk+SYIoilGoriW96RbU8+jPo2W4A1ENzeMJfrNCSWtUSZZp4XT4tO
8m1PGud07XBLSxd62bAEDV3KZO2DnY1WxgXbKuIHSi9D5CI1LMo=
=7UW+
-----END PGP SIGNATURE-----
Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We've collected some fixes in since the pre-merge window freeze.
There's technically only one regression fix for 4.15, but the rest
seems important and candidates for stable.
- fix missing flush bio puts in error cases (is serious, but rarely
happens)
- fix reporting stat::st_blocks for buffered append writes
- fix space cache invalidation
- fix out of bound memory access when setting zlib level
- fix potential memory corruption when fsync fails in the middle
- fix crash in integrity checker
- incremetnal send fix, path mixup for certain unlink/rename
combination
- pass flags to writeback so compressed writes can be throttled
properly
- error handling fixes"
* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: incremental send, fix wrong unlink path after renaming file
btrfs: tree-checker: Fix false panic for sanity test
Btrfs: fix list_add corruption and soft lockups in fsync
btrfs: Fix wild memory access in compression level parser
btrfs: fix deadlock when writing out space cache
btrfs: clear space cache inode generation always
Btrfs: fix reported number of inode blocks after buffered append writes
Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Btrfs: bail out gracefully rather than BUG_ON
btrfs: dev_alloc_list is not protected by RCU, use normal list_del
btrfs: add missing device::flush_bio puts
btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
Btrfs: add write_flags for compression bio
This is a pure automated search-and-replace of the internal kernel
superblock flags.
The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.
Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.
The script to do this was:
# places to look in; re security/*: it generally should *not* be
# touched (that stuff parses mount(2) arguments directly), but
# there are two places where we really deal with superblock flags.
FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
include/linux/fs.h include/uapi/linux/bfs_fs.h \
security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
# the list of MS_... constants
SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
ACTIVE NOUSER"
SED_PROG=
for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
# we want files that contain at least one of MS_...,
# with fs/namespace.c and fs/pnode.c excluded.
L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
for f in $L; do sed -i $f $SED_PROG; done
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot. As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.
No performance impact is expected as the overhead is marginal. The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.
Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages. Just drop the argument.
Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.
Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.
It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.
This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from David Sterba:
"There are some new user features and the usual load of invisible
enhancements or cleanups.
New features:
- extend mount options to specify zlib compression level, -o
compress=zlib:9
- v2 of ioctl "extent to inode mapping", addressing a usecase where
we want to retrieve more but inaccurate results and do the
postprocessing in userspace, aiding defragmentation or
deduplication tools
- populate compression heuristics logic, do data sampling and try to
guess compressibility by: looking for repeated patterns, counting
unique byte values and distribution, calculating Shannon entropy;
this will need more benchmarking and possibly fine tuning, but the
base should be good enough
- enable indexing for btrfs as lower filesystem in overlayfs
- speedup page cache readahead during send on large files
Internal enhancements:
- more sanity checks of b-tree items when reading them from disk
- more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
errno or error handling fixes
- remove some homegrown IO-related logic, that's been obsoleted by
core block layer changes (batching, plug/unplug, own counters)
- add ref-verify, optional debugging feature to verify extent
reference accounting
- simplify code handling outstanding extents, make it more clear
where and how the accounting is done
- make delalloc reservations per-inode, simplify the code and make
the logic more straightforward
- extensive cleanup of delayed refs code
Notable fixes:
- fix send ioctl on 32bit with 64bit kernel"
* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
Btrfs: heuristic: add Shannon entropy calculation
Btrfs: heuristic: add byte core set calculation
Btrfs: heuristic: add byte set calculation
Btrfs: heuristic: add detection of repeated data patterns
Btrfs: heuristic: implement sampling logic
Btrfs: heuristic: add bucket and sample counters and other defines
Btrfs: compression: separate heuristic/compression workspaces
btrfs: move btrfs_truncate_block out of trans handle
btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
btrfs: track refs in a rb_tree instead of a list
btrfs: add a comp_refs() helper
btrfs: switch args for comp_*_refs
btrfs: make the delalloc block rsv per inode
btrfs: add tracepoints for outstanding extents mods
Btrfs: rework outstanding_extents
btrfs: increase output size for LOGICAL_INO_V2 ioctl
btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
btrfs: send: remove unused code
...
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The use of sector_t in the callchain of submit_extent_page is not
necessary. Switch to u64 and rename the variable and use byte units
instead of 512b, ie. dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to remove sector_t and will use 'offset', so this patch
frees the name.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
"Two more fixes for bugs introduced in 4.13.
The sector_t problem with 32bit architecture and !LBDAF config seems
serious but the number of affected deployments is hopefully low.
The clashing status bits could lead to a confusing in-memory state of
the whole-filesystem operations if used with the quota override sysfs
knob"
* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix overlap of fs_info::flags values
btrfs: avoid overflow when sector_t is 32 bit
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.
In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).
I added a cast to u64 to avoid overflow.
Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
"We've collected a bunch of isolated fixes, for crashes, user-visible
behaviour or missing bits from other subsystem cleanups from the past.
The overall number is not small but I was not able to make it
significantly smaller. Most of the patches are supposed to go to
stable"
* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: log csums for all modified extents
Btrfs: fix unexpected result when dio reading corrupted blocks
btrfs: Report error on removing qgroup if del_qgroup_item fails
Btrfs: skip checksum when reading compressed data if some IO have failed
Btrfs: fix kernel oops while reading compressed data
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
Btrfs: do not backup tree roots when fsync
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
Btrfs: send: fix error number for unknown inode types
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: finish ordered extent cleaning if no progress is found
btrfs: clear ordered flag on cleaning up ordered extents
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
Btrfs: do not reset bio->bi_ops while writing bio
Btrfs: use the new helper wbc_to_write_flags
flush_epd_write_bio() sets bio->bi_opf by itself to honor REQ_SYNC,
but it's not needed at all since bio->bi_opf has set up properly in
both __extent_writepage() and write_one_eb(), and in the case of
write_one_eb(), it also sets REQ_META, which we will lose in
flush_epd_write_bio().
This remove this unnecessary bio->bi_opf setting.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This updates btrfs to use the helper wbc_to_write_flags which has been
applied in ext4/xfs/f2fs/block.
Please note that, with this, btrfs's dirty pages written by a
writeback job will carry the flag REQ_BACKGROUND, which is currently
used by writeback-throttle to determine whether it should go to get a
request or wait.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Pull btrfs updates from David Sterba:
"The changes range through all types: cleanups, core chagnes, sanity
checks, fixes, other user visible changes, detailed list below:
- deprecated: user transaction ioctl
- mount option ssd does not change allocation alignments
- degraded read-write mount is allowed if all the raid profile
constraints are met, now based on more accurate check
- defrag: do not reset compression afterwards; the NOCOMPRESS flag
can be now overriden by defrag
- prep work for better extent reference tracking (related to the
qgroup slowness with balance)
- prep work for compression heuristics
- memory allocation reductions (may help latencies on a loaded
system)
- better accounting for io waiting states
- error handling improvements (removed BUGs)
- added more sanity checks for shared refs
- fix readdir vs pagefault deadlock under some circumstances
- fix for 'no-hole' mode, certain combination of compressed and
inline extents
- send: fix emission of invalid clone operations
- fixup file mode if setting acls fail
- more fixes from fuzzing
- oher cleanups"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
btrfs: submit superblock io with REQ_META and REQ_PRIO
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
btrfs: pass fs_info to btrfs_del_root instead of tree_root
Btrfs: add one more sanity check for shared ref type
Btrfs: remove BUG_ON in __add_tree_block
Btrfs: remove BUG() in add_data_reference
Btrfs: remove BUG() in print_extent_item
Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Btrfs: convert to use btrfs_get_extent_inline_ref_type
Btrfs: add a helper to retrive extent inline ref type
btrfs: scrub: simplify scrub worker initialization
btrfs: scrub: clean up division in scrub_find_csum
btrfs: scrub: clean up division in __scrub_mark_bitmap
btrfs: scrub: use bool for flush_all_writes
btrfs: preserve i_mode if __btrfs_set_acl() fails
btrfs: Remove extraneous chunk_objectid variable
btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs: Remove extra parentheses from condition in copy_items()
...
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.
if (start < eb->len) and (start + len > eb->len),
then
a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,
b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).
The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.
It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref". And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.
This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function submit_extent_page has 15(!) parameters right now, op and
op_flags are effectively one value stored to bio::bi_opf, no need to
pass them separately. So it's 14 parameters now.
Signed-off-by: David Sterba <dsterba@suse.com>
This also adjusts the respective callers in other files. Those were
found with -Wunused-parameter.
btrfs_full_stripe_len's mapping_tree - introduced by 53b381b3ab
("Btrfs: RAID5 and RAID6") but it was never really used even in that
commit
btrfs_is_parity_mirror's mirror_num - same as above
chunk_drange_filter's chunk_offset - introduced by 94e60d5a5c ("Btrfs:
devid subset filter") and never used.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit afce772e87 ("btrfs: fix check_shared for fiemap ioctl") added
transaction semantics around calls to btrfs_check_shared() in order to
provide accurate accounting of delayed refs. The transaction management
should be done inside btrfs_check_shared(), so that callers do not need
to manage transactions individually.
Signed-off-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have reader helpers for most of the on-disk structures that use
an extent_buffer and pointer as offset into the buffer that are
read-only. We should mark them as const and, in turn, allow consumers
of these interfaces to mark the buffers const as well.
No impact on code, but serves as documentation that a buffer is intended
not to be modified.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Firstly by applying the following with coccinelle's spatch:
@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)
to effect the conversion to sb_rdonly(sb), then by applying:
@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)
@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)
to remove left over excess bracketage and finally by applying:
@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)
to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull btrfs fixes from David Sterba:
"We've identified and fixed a silent corruption (introduced by code in
the first pull), a fixup after the blk_status_t merge and two fixes to
incremental send that Filipe has been hunting for some time"
* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unexpected return value of bio_readpage_error
btrfs: btrfs_create_repair_bio never fails, skip error handling
btrfs: cloned bios must not be iterated by bio_for_each_segment_all
Btrfs: fix write corruption due to bio cloning on raid5/6
Btrfs: incremental send, fix invalid memory access
Btrfs: incremental send, fix invalid path for link commands
With blk_status_t conversion (that are now present in master),
bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set
%ret if it runs without problems.
This fixes that unexpected return value by changing
btrfs_check_repairable() to return a bool instead of updating %ret, and
patch is applicable to both codebases with and without blk_status_t.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration. Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions. This patch adds assertions to all remaining
bio_for_each_segment_all cases.
[1] https://patchwork.kernel.org/patch/9838535/
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull percpu updates from Tejun Heo:
"These are the percpu changes for the v4.13-rc1 merge window. There are
a couple visibility related changes - tracepoints and allocator stats
through debugfs, along with __ro_after_init markings and a cosmetic
rename in percpu_counter.
Please note that the simple O(#elements_in_the_chunk) area allocator
used by percpu allocator is again showing scalability issues,
primarily with bpf allocating and freeing large number of counters.
Dennis is working on the replacement allocator and the percpu
allocator will be seeing increased churns in the coming cycles"
* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix static checker warnings in pcpu_destroy_chunk
percpu: fix early calls for spinlock in pcpu_stats
percpu: resolve err may not be initialized in pcpu_alloc
percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
percpu: add tracepoint support for percpu memory
percpu: expose statistics about percpu memory via debugfs
percpu: migrate percpu data structures to internal header
percpu: add missing lockdep_assert_held to func pcpu_free_area
mark most percpu globals as __ro_after_init
Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.
User visible changes:
- statx support
- quota override tunable
- improved compression thresholds
- obsoleted mount option alloc_start
Core updates:
- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios
- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates
- prep work for btree_inode removal
- dir-item validation
- qgoup fixes and updates
- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"
* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...
Commit 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent before
submit it to user") introduced a warning to catch unemitted cached
fiemap extent.
However such warning doesn't take the following case into consideration:
0 4K 8K
|<---- fiemap range --->|
|<----------- On-disk extent ------------------>|
In this case, the whole 0~8K is cached, and since it's larger than
fiemap range, it break the fiemap extent emit loop.
This leaves the fiemap extent cached but not emitted, and caught by the
final fiemap extent sanity check, causing kernel warning.
This patch removes the kernel warning and renames the sanity check to
emit_last_fiemap_cache() since it's possible and valid to have cached
fiemap extent.
Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Fixes: 4751832da9 ("btrfs: fiemap: Cache and merge fiemap extent ...")
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable. Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add. In terms of context-safety,
they're equivalent. The only difference is that the __ version takes
a batch parameter.
Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.
This patch doesn't cause any functional changes.
tj: Minor updates to patch description for clarity. Cosmetic
indentation updates.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use btrfs_bioset for bios and ask to allocate the entire size of
btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is
initialized but the bytes from 0 to offset of 'bio' are left
uninitialized. Although we initialize some of the members in our
helpers, we should initialize the whole structures.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that
in the helper and simplify the logic in the callsers.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES.
submit_extent_page adds __GFP_HIGH that does not make a difference in
our case as it allows access to memory reserves but otherwise does not
change the constraints.
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_bio_alloc that do error handling, that we
can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Christoph pointed out that bio allocations backed by a bioset will never
fail. As we always use a bioset for all bio allocations, we can skip
the error handling. This patch adjusts our low-level helpers, the
cascaded changes to all callers will come next.
CC: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5f39d397df ("Btrfs: Create extent_buffer interface
for large blocksizes") refactored btrfs_leaf_data function to take
extent_buffer rather than struct btrfs_leaf. However, as it turns out the
parameter being passed is never used. Furthermore this function no longer
returns the leaf data but rather the offset to it. So rename the function
to BTRFS_LEAF_DATA_OFFSET to make it consistent with other BTRFS_LEAF_*
helpers and turn it into a macro.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ removed () from the macro ]
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.
Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio. Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.
Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it. This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead pass around the failure tree and the io tree.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The setting of return code ret should be based on the error code
passed into function end_extent_writepage and not on ret. Thanks
to Liu Bo for spotting this mistake in the original fix I submitted.
Detected by CoverityScan, CID#1414312 ("Logically dead code")
Fixes: 5dca6eea91 ("Btrfs: mark mapping with error flag to report errors to userspace")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that scrub can fix data errors with the help of parity for raid56
profile, repair during read is able to as well.
Although the mirror num in raid56 scenario has different meanings, i.e.
0 or 1: read data directly
> 1: do recover with parity,
it could be fit into how we repair bad block during read.
The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
device and position on it.
Cc: David Sterba <dsterba@suse.cz>
Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have similar code here and there, this merges them into a helper.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 20a7db8ab3 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.
This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest. We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: David Sterba <dsterba@suse.com>
The bug is a regression after commit
(da2c7009f6 "btrfs: teach __process_pages_contig about PAGE_LOCK operation")
and commit
(76c0021db8 "Btrfs: use helper to simplify lock/unlock pages").
So if the dirty pages which are under writeback got truncated partially
before we lock the dirty pages, we couldn't find all pages mapping to the
delalloc range, and the bug didn't return an error so it kept going on and
found that the delalloc range got truncated and got to unlock the dirty
pages, and then the ASSERT could caught the error, and showed
-----------------------------------------------------------------------------
assertion failed: page_ops & PAGE_LOCK, file: fs/btrfs/extent_io.c, line: 1716
-----------------------------------------------------------------------------
This fixes the bug by returning the proper -EAGAIN.
Cc: David Sterba <dsterba@suse.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.
Signed-off-by: David Sterba <dsterba@suse.com>
We know that eadpage_end_io_hook, submit_bio_hook and merge_bio_hook are
always defined so we can drop the checks before we call them.
Signed-off-by: David Sterba <dsterba@suse.com>
There's no error path in any of the instances, always return 0.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'tree' was used to call locking hook that does not exist anymore.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic has been updated in "Btrfs: make mapping->writeback_index
point to the last written page" (a91326679f) and page is not
needed anymore.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This used to hold number of maximum pages to allocate, but this is now
limited by BIO_MAX_PAGES. The local are now unused and removed as well.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can embed range_changed to the extent changeset to address following
problems:
- no need to allocate ulist dynamically, we also get rid of the GFP_NOFS
for free
- fix lack of allocation failure checking in btrfs_qgroup_reserve_data
The stack consuption where extent_changeset is used slightly increases:
before: 16
after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40
Which is bearable.
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.
__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.
For reproducing the hang, we inject a fault like
if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}
in submit_extent_page.
We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.
Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have a helper to set page bits, let lock_delalloc_pages and
__unlock_for_delalloc use it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changes to the helper separated from the following patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
This introduces a new helper which can be used to process pages bits.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@cached_state is no more required in __extent_writepage_io, also remove
the goto label.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
try_release_extent_state reduces the gfp mask to GFP_NOFS if it is
compatible. This is true for GFP_KERNEL as well. There is no real
reason to do that though. There is no new lock taken down the
the only consumer of the gfp mask which is
try_release_extent_state
clear_extent_bit
__clear_extent_bit
alloc_extent_state
So this seems just unnecessary and confusing.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
b335b0034e ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator")
has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just
to prevent from giving an unappropriate gfp mask to the slab allocator
deeper down the callchain (in alloc_extent_state). This is wrong for
two reasons a) GFP_NOFS might be just too restrictive for the calling
context b) it is better to tweak the gfp mask down when it needs that.
So just remove the mask tweaking from btrfs_releasepage and move it
down to alloc_extent_state where it is needed.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
here, and Christoph pitched in corrections/improvements to make btrfs
use proper helpers for bio walking instead of doing it by hand.
There are some key fixes as well, including some long standing bugs
that took forever to track down in btrfs_drop_extents and during
balance"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
btrfs: limit async_work allocation and worker func duration
Revert "Btrfs: adjust len of writes if following a preallocated extent"
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
btrfs: opencode chunk locking, remove helpers
btrfs: remove root parameter from transaction commit/end routines
btrfs: split btrfs_wait_marked_extents into normal and tree log functions
btrfs: take an fs_info directly when the root is not used otherwise
btrfs: simplify btrfs_wait_cache_io prototype
btrfs: convert extent-tree tracepoints to use fs_info
btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
btrfs: root->fs_info cleanup, add fs_info convenience variables
btrfs: root->fs_info cleanup, update_block_group{,flags}
btrfs: root->fs_info cleanup, lock/unlock_chunks
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
btrfs: pull node/sector/stripe sizes out of root and into fs_info
btrfs: root->fs_info cleanup, io_ctl_init
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
btrfs: struct reada_control.root -> reada_control.fs_info
btrfs: struct btrfsic_state->root should be an fs_info
btrfs: alloc_reserved_file_extent trace point should use extent_root
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using copy_extent_buffer is suitable for copying betwenn buffers from an
arbitrary offset and deals with page boundaries. This is not necessary
when doing a full extent_buffer-to-extent_buffer copy. We can utilize
the copy_page helper as well.
Signed-off-by: David Sterba <dsterba@suse.com>
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.
Signed-off-by: David Sterba <dsterba@suse.com>
The fsid and chunk tree uuid are always located in the first page,
we don't need the to use write_extent_buffer.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations. But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags. This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We cast 0 to a u8 but then because of type promotion, it's immediately
cast to int back to int before we do a bitwise negate. The cast doesn't
matter in this case, the code works as intended. It causes a static
checker warning though so let's remove it.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In convert_free_space_to_{bitmaps,extents}(), we buffer the free space
bitmaps in memory and copy them directly to/from the extent buffers with
{read,write}_extent_buffer(). The extent buffer bitmap helpers use byte
granularity, which is equivalent to a little-endian bitmap. This means
that on big-endian systems, the in-memory bitmaps will be written to
disk byte-swapped. To fix this, use byte-granularity for the bitmaps in
memory.
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an additional patch to
"Btrfs: memset to avoid stale content in btree node block".
This uses memset to initialize the unused space in a leaf to avoid
potential stale content, which may be incurred by pushing items
between sibling leaves.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For many printks, we want to know which file system issued the message.
This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.
fs/btrfs/check-integrity.c is left alone for another day.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch converts printk(KERN_* style messages to use the pr_* versions.
One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."
This patch unsplits user-visible strings.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During updating btree, we could push items between sibling
nodes/leaves, for leaves data sections starts reversely from
the end of the block while for nodes we only have key pairs
which are stored one by one from the start of the block.
So we could do try to push key pairs from one node to the next
node right in the tree, and after that, we update the node's
nritems to reflect the correct end while leaving the stale
content in the node. One may intentionally corrupt the fs
image and access the stale content by bumping the nritems and
causes various crashes.
This takes the in-memory @nritems as the correct one and
gets to memset the unused part of a btree node.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nobody uses this, it makes no sense to do partial reads of extent buffers.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.
This should reduce conflict of both patchset and the effort to rebase
them.
Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.
This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.
t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)
So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only in the case of different root_id or different object_id, check_shared
identified extent as the shared. However, If a extent was referred by
different offset of same file, it should also be identified as shared.
In addition, check_shared's loop scale is at least n^3, so if a extent
has too many references, even causes soft hang up.
First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
Then individually add the on-disk reference(inline/keyed) to the ref_tree
and calculate the unique_refs of the ref_tree to check if the unique_refs
is greater than one.Because once there are two references to return
SHARED, so the time complexity is close to the constant.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull block fixes from Jens Axboe:
"Here's the second round of block updates for this merge window.
It's a mix of fixes for changes that went in previously in this round,
and fixes in general. This pull request contains:
- Fixes for loop from Christoph
- A bdi vs gendisk lifetime fix from Dan, worth two cookies.
- A blk-mq timeout fix, when on frozen queues. From Gabriel.
- Writeback fix from Jan, ensuring that __writeback_single_inode()
does the right thing.
- Fix for bio->bi_rw usage in f2fs from me.
- Error path deadlock fix in blk-mq sysfs registration from me.
- Floppy O_ACCMODE fix from Jiri.
- Fix to the new bio op methods from Mike.
One more followup will be coming here, ensuring that we don't
propagate the block types outside of block. That, and a rename of
bio->bi_rw is coming right after -rc1 is cut.
- Various little fixes"
* 'for-linus' of git://git.kernel.dk/linux-block:
mm/block: convert rw_page users to bio op use
loop: make do_req_filebacked more robust
loop: don't try to use AIO for discards
blk-mq: fix deadlock in blk_mq_register_disk() error path
Include: blkdev: Removed duplicate 'struct request;' declaration.
Fixup direct bi_rw modifiers
block: fix bdi vs gendisk lifetime mismatch
blk-mq: Allow timeouts to run while queue is freezing
nbd: fix race in ioctl
block: fix use-after-free in seq file
f2fs: drop bio->bi_rw manual assignment
block: add missing group association in bio-cloning functions
blkcg: kill unused field nr_undestroyed_grps
writeback: Write dirty times for WB_SYNC_ALL writeback
floppy: fix open(O_ACCMODE) for ioctl-only open
Pull more btrfs updates from Chris Mason:
"This is part two of my btrfs pull, which is some cleanups and a batch
of fixes.
Most of the code here is from Jeff Mahoney, making the pointers we
pass around internally more consistent and less confusing overall. I
noticed a small problem right before I sent this out yesterday, so I
fixed it up and re-tested overnight"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
Btrfs: fix __MAX_CSUM_ITEMS
btrfs: btrfs_abort_transaction, drop root parameter
btrfs: add btrfs_trans_handle->fs_info pointer
btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
btrfs: convert nodesize macros to static inlines
btrfs: introduce BTRFS_MAX_ITEM_SIZE
btrfs: cleanup, remove prototype for btrfs_find_root_ref
btrfs: copy_to_sk drop unused root parameter
btrfs: simpilify btrfs_subvol_inherit_props
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
btrfs: tests, require fs_info for root
btrfs: tests, move initialization into tests/
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs: prefix fsid to all trace events
btrfs: plumb fs_info into btrfs_work
btrfs: remove obsolete part of comment in statfs
btrfs: hide test-only member under ifdef
btrfs: Ratelimit "no csum found" info message
btrfs: Add ratelimit to btrfs printing
Btrfs: fix unexpected balance crash due to BUG_ON
...
bi_rw should be using bio_set_op_attrs to set bi_rw.
Signed-off-by: Shaun Tancheff <shaun@tancheff.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a bio is cloned, the newly created bio must be associated with
the same blkcg as the original bio (if BLK_CGROUP is enabled). If
this operation is not performed, then the new bio is not associated
with any group, and the group of the current task is returned when
the group of the bio is requested.
Depending on the cloning frequency, this may cause a large
percentage of the bios belonging to a given group to be treated
as if belonging to other groups (in most cases as if belonging to
the root group). The expected group isolation may thereby be broken.
This commit adds the missing association in bio-cloning functions.
Fixes: da2f0f74cf ("Btrfs: add support for blkio controllers")
Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.
This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well. The same applies to read_cache_pages.
ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts
- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph
- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives
- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff
- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin
- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
eb->io_pages is set in read_extent_buffer_pages().
In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.
When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
and ends up with a memory leak eventually.
This lets __do_readpage propagate errors to callers and adds the
'atomic_dec(&eb->io_pages)'.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
their btrfs.
Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.
To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
than pagesize,
2. when it detects something insane.
The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.
Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.
This adds a warning to let us quickly know what was happening.
Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch has btrfs's submit_one_bio users set the bio op using
bio_set_op_attrs and get the op using bio_op.
The next patches will continue to convert btrfs,
so submit_bio_hook and merge_bio_hook
related code will be modified to take only the bio. I did
not do it in this patch to try and keep it smaller.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".
So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
btrfs's fiemap is supposed to return 0 on success and return < 0 on
error. however, ret becomes 1 after looking up the last file extent:
btrfs_lookup_file_extent ->
btrfs_search_slot(..., ins_len=0, cow=0)
and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.
This may confuse applications using ioctl(FIEL_IOC_FIEMAP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It seems to be long time unused, since 2008 and
6885f308b5 ("Btrfs: Misc 2.6.25 updates").
Propagating the removal touches some code but has no functional effect.
Signed-off-by: David Sterba <dsterba@suse.com>
Single caller passes GFP_NOFS. We can get rid of the
gfpflags_allow_blocking checks as NOFS can block but does not recurse to
filesystem through reclaim.
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to __clear_extent_bit, do not fail if the state preallocation
fails as we might not need it. One less BUG_ON.
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we bail out immediately if ->writepage() returns an error,
we don't need an extra error to retain the error code.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If sequential writer is writing in the middle of the page and it just redirties
the last written page by continuing from it.
In the above case this can end up with seeking back to that firstly redirtied
page after writing all the pages at the end of file because btrfs updates
mapping->writeback_index to 1 past the current one.
For non-cow filesystems, the cost is only about extra seek, while for cow
filesystems such as btrfs, it means unnecessary fragments.
To avoid it, we just need to continue writeback from the last written page.
This also updates btrfs to behave like what write_cache_pages() does, ie, bail
out immediately if there is an error in writepage().
<Ref: https://www.spinics.net/lists/linux-btrfs/msg52628.html>
Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
to partially inline the get_state_failrec() function but cannot
figure out that means the failrec pointer is always valid
if the function returns success, which causes a harmless
warning:
fs/btrfs/extent_io.c: In function 'clean_io_failure':
fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This marks get_state_failrec() and set_state_failrec() both
as 'noinline', which avoids the warning in all cases for me,
and seems less ugly than adding a fake initialization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 47dc196ae7 ("btrfs: use proper type for failrec in extent_state")
Signed-off-by: David Sterba <dsterba@suse.com>
Cleanup.
kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Not needed after the previous patch named
"Btrfs: fix page reading in extent_same ioctl leading to csum errors".
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sanity test the extent buffer bitmap operations (test, set, and clear)
against the equivalent standard kernel operations.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. There's a
BUG_ON but it's a sanity check and not an error condition we could
recover from.
Signed-off-by: David Sterba <dsterba@suse.com>
Does not return any errors, nor anything from the callgraph. The branch
in end_bio_extent_writepage has been skipped since
5fd0204355 ("Btrfs: finish ordered extents in their own thread").
Signed-off-by: David Sterba <dsterba@suse.com>
The funcions just wrap the clear_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
decreases:
text data bss dec hex filename
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko.before
939651 43670 23144 1006465 f5b81 fs/btrfs/btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.com>
The funcions just wrap the set_extent_bit API and generate function
calls. This increases stack consumption and may negatively affect
performance due to icache misses. We can simply make the helpers static
inline and keep the type checking and API untouched. The code slightly
increases:
text data bss dec hex filename
938427 43670 23144 1005241 f56b9 fs/btrfs/btrfs.ko.before
938667 43670 23144 1005481 f57a9 fs/btrfs/btrfs.ko
Signed-off-by: David Sterba <dsterba@suse.com>
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce new function clear_record_extent_bits(), which will clear bits
for given range and record the details about which ranges are cleared
and how many bytes in total it changes.
This provides the basis for later qgroup reserve codes.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Introduce new function set_record_extent_bits(), which will not only set
given bits, but also record how many bytes are changed, and detailed
range info.
This is quite important for later qgroup reserve framework.
The number of bytes will be used to do qgroup reserve, and detailed
range info will be used to cleanup for EQUOT case.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
If when reading a page we find a hole and our caller had already locked
the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end
up unlocking the hole's range and then later our caller unlocks it
again, which might have already been locked by some other task once
the first unlock happened.
Currently this can only happen during a call to the extent_same ioctl,
as it's the only caller of __do_readpage() that sets the bit
EXTENT_BIO_PARENT_LOCKED for bio flags.
Fix this by leaving the unlock exclusively to the caller.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs fixes from Chris Mason:
"These are small and assorted. Neil's is the oldest, I dropped the
ball thinking he was going to send it in"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: support NFSv2 export
Btrfs: open_ctree: Fix possible memory leak
Btrfs: fix deadlock when finalizing block group creation
Btrfs: update fix for read corruption of compressed and shared extents
Btrfs: send, fix corner case for reference overwrite detection
Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.
Signed-off-by: David Sterba <dsterba@suse.com>
My previous fix in commit 005efedf2c ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().
The following test case for fstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create our test file with a single extent of 64Kb that is going to
# be compressed no matter which compression algo is used (zlib/lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone the compressed extent into an adjacent file offset.
$CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
echo "File digest before unmount:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
# Remount the fs or clear the page cache to trigger the bug in
# btrfs. Because the extent has an uncompressed length that is a
# multiple of 16 pages, all the pages belonging to the second range
# of the file (64K to 128K), which points to the same extent as the
# first range (0K to 64K), had their contents full of zeroes instead
# of the byte 0xaa. This was a bug exclusively in the read path of
# compressed extents, the correct data was stored on disk, btrfs
# just failed to fill in the pages correctly.
_scratch_remount
echo "File digest after remount:"
# Must match the digest we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
Pull btrfs fixes from Chris Mason:
"This is an assorted set I've been queuing up:
Jeff Mahoney tracked down a tricky one where we ended up starting IO
on the wrong mapping for special files in btrfs_evict_inode. A few
people reported this one on the list.
Filipe found (and provided a test for) a difficult bug in reading
compressed extents, and Josef fixed up some quota record keeping with
snapshot deletion. Chandan killed off an accounting bug during DIO
that lead to WARN_ONs as we freed inodes"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: keep dropped roots in cache until transaction commit
Btrfs: Direct I/O: Fix space accounting
btrfs: skip waiting on ordered range for special files
Btrfs: fix read corruption of compressed and shared extents
Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
Btrfs: don't initialize a space info as full to prevent ENOSPC
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Pull btrfs updates from Chris Mason:
"This has Jeff Mahoney's long standing trim patch that fixes corners
where trims were missing. Omar has some raid5/6 fixes, especially for
using scrub and device replace when devices are missing.
Zhao Lie continues cleaning and fixing things, this series fixes some
really hard to hit corners in xfstests. I had to pull it last merge
window due to some deadlocks, but those are now resolved.
I added support for Tejun's new blkio controllers. It seems to work
well for single devices, we'll expand to multi-device as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
btrfs: fix compile when block cgroups are not enabled
Btrfs: fix file read corruption after extent cloning and fsync
Btrfs: check if previous transaction aborted to avoid fs corruption
btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
btrfs: Prevent from early transaction abort
btrfs: Remove unused arguments in tree-log.c
btrfs: Remove useless condition in start_log_trans()
Btrfs: add support for blkio controllers
Btrfs: remove unused mutex from struct 'btrfs_fs_info'
Btrfs: fix parity scrub of RAID 5/6 with missing device
Btrfs: fix device replace of a missing RAID 5/6 device
Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
Btrfs: count devices correctly in readahead during RAID 5/6 replace
Btrfs: remove misleading handling of missing device scrub
btrfs: fix clone / extent-same deadlocks
Btrfs: fix defrag to merge tail file extent
Btrfs: fix warning in backref walking
btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
btrfs: Remove root argument in extent_data_ref_count()
btrfs: Fix wrong comment of btrfs_alloc_tree_block()
...
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them. It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.
The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.
Signed-off-by: Chris Mason <clm@fb.com>
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.
Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.
Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.
The end result is able to throttle nicely on single disk filesystems. A
little more work is required for multi-device filesystems.
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull btrfs updates from Chris Mason:
"Outside of our usual batch of fixes, this integrates the subvolume
quota updates that Qu Wenruo from Fujitsu has been working on for a
few releases now. He gets an extra gold star for making btrfs smaller
this time, and fixing a number of quota corners in the process.
Dave Sterba tested and integrated Anand Jain's sysfs improvements.
Outside of exporting a symbol (ack'd by Greg) these are all internal
to btrfs and it's mostly cleanups and fixes. Anand also attached some
of our sysfs objects to our internal device management structs instead
of an object off the super block. It will make device management
easier overall and it's a better fit for how the sysfs files are used.
None of the existing sysfs files are moved around.
Thanks for all the fixes everyone"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
Btrfs: Check if kobject is initialized before put
lib: export symbol kobject_move()
Btrfs: sysfs: add support to show replacing target in the sysfs
Btrfs: free the stale device
Btrfs: use received_uuid of parent during send
Btrfs: fix use-after-free in btrfs_replay_log
btrfs: wait for delayed iputs on no space
btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
btrfs: ulist: Add ulist_del() function.
btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
btrfs: qgroup: Switch rescan to new mechanism.
btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
btrfs: qgroup: Add new function to record old_roots.
btrfs: qgroup: Record possible quota-related extent for qgroup.
btrfs: qgroup: Add function qgroup_update_counters().
...
Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
optimizations and cleanups, mixed with various fixes. In more detail,
this contains:
- Addition of policy specific data to blkcg for block cgroups. From
Arianna Avanzini.
- Various cleanups around command types from Christoph.
- Cleanup of the suspend block I/O path from Christoph.
- Plugging updates from Shaohua and Jeff Moyer, for blk-mq.
- Eliminating atomic inc/dec of both remaining IO count and reference
count in a bio. From me.
- Fixes for SG gap and chunk size support for data-less (discards)
IO, so we can merge these better. From me.
- Small restructuring of blk-mq shared tag support, freeing drivers
from iterating hardware queues. From Keith Busch.
- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
IOPS mode the default for non-rotational storage"
* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
cfq-iosched: move group scheduling functions under ifdef
cfq-iosched: fix the setting of IOPS mode on SSDs
blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
block, cgroup: implement policy-specific per-blkcg data
block: Make CFQ default to IOPS mode on SSDs
block: add blk_set_queue_dying() to blkdev.h
blk-mq: Shared tag enhancements
block: don't honor chunk sizes for data-less IO
block: only honor SG gap prevention for merges that contain data
block: fix returnvar.cocci warnings
block, dm: don't copy bios for request clones
block: remove management of bi_remaining when restoring original bi_end_io
block: replace trylock with mutex_lock in blkdev_reread_part()
block: export blkdev_reread_part() and __blkdev_reread_part()
suspend: simplify block I/O handling
block: collapse bio bit space
block: remove unused BIO_RW_BLOCK and BIO_EOF flags
block: remove BIO_EOPNOTSUPP
...
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits()
through free_io_failure(), we weren't waking up any tasks waiting for the
extent's state EXTENT_LOCKED bit, leading to an hang.
So make sure clear_extent_bits() ends up waking up any waiters if the
bit EXTENT_LOCKED is supplied by its callers.
Zygo Blaxell was experiencing such hangs at inode eviction time after
file unlinks. Thanks to him for a set of scripts to reproduce the issue.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:
BUG_ON(extent_buffer_under_io(eb))
The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.
Here follows a sequence of steps that leads to the BUG_ON.
CPU 0 CPU 1 CPU 2
path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X
btrfs_release_path(path)
unlocks X
reads eb X
X->refs incremented to 3
locks eb X
btrfs_del_items(X)
X becomes empty
clean_tree_block(X)
clear EXTENT_BUFFER_DIRTY from X
btrfs_del_leaf(X)
unlocks X
extent_buffer_get(X)
X->refs incremented to 4
btrfs_free_tree_block(X)
X's range is not pinned
X's range added to free
space cache
free_extent_buffer_stale(X)
lock X->refs_lock
set EXTENT_BUFFER_STALE on X
release_extent_buffer(X)
X->refs decremented to 3
unlocks X->refs_lock
btrfs_release_path()
unlocks X
free_extent_buffer(X)
X->refs becomes 2
__btrfs_cow_block(Y)
btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
gets offset == X->start
btrfs_init_new_buffer(X->start)
btrfs_find_create_tree_block(X->start)
alloc_extent_buffer(X->start)
find_extent_buffer(X->start)
finds eb X in radix tree
free_extent_buffer(X)
lock X->refs_lock
test X->refs == 2
test bit EXTENT_BUFFER_STALE is set
test !extent_buffer_under_io(eb)
increments X->refs to 3
mark_extent_buffer_accessed(X)
check_buffer_tree_ref(X)
--> does nothing,
X->refs >= 2 and
EXTENT_BUFFER_TREE_REF
is set in X
clear EXTENT_BUFFER_STALE from X
locks X
btrfs_mark_buffer_dirty()
set_extent_buffer_dirty(X)
check_buffer_tree_ref(X)
--> does nothing, X->refs >= 2 and
EXTENT_BUFFER_TREE_REF is set
sets EXTENT_BUFFER_DIRTY on X
test and clear EXTENT_BUFFER_TREE_REF
decrements X->refs to 2
release_extent_buffer(X)
decrements X->refs to 1
unlock X->refs_lock
unlock X
free_extent_buffer(X)
lock X->refs_lock
release_extent_buffer(X)
decrements X->refs to 0
btrfs_release_extent_buffer_page(X)
BUG_ON(extent_buffer_under_io(X))
--> EXTENT_BUFFER_DIRTY set on X
Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.
A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:
[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>] [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008] [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008] [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008] [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008] [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008] [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008] [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008] [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008] [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008] [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008] [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008] [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008] [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0
I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):
while true; do
mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
snapshot_cmd="btrfs subvolume snapshot -r /mnt"
snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
umount /mnt
done
Which usually triggers the BUG_ON within less than 24 hours:
[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>] [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031] [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031] [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031] [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031] [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031] [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031] [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031] [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054] [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054] [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054] [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054] [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054] [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054] [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054] [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054] [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054] [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054] [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().
Running following command repeatly can check this memory leak problem
btrfs inspect-internal inode-resolve 256 /mnt/btrfs
Signed-off-by: Chien-Kuan Yeh <ckya@synology.com>
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Consider the following interleaving of overlapping calls to
alloc_extent_buffer:
Call 1:
- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages
Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
mark_extent_buffer_accessed on it
mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.
The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.
Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Most of these are fixing extent reservation accounting, or corners
with tree writeback during commit.
Josef's set does add a test, which isn't strictly a fix, but it'll
keep us from making this same mistake again"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix outstanding_extents accounting in DIO
Btrfs: add sanity test for outstanding_extents accounting
Btrfs: just free dummy extent buffers
Btrfs: account merges/splits properly
Btrfs: prepare block group cache before writing
Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Btrfs: account for the correct number of extents for delalloc reservations
Btrfs: fix merge delalloc logic
Btrfs: fix comp_oper to get right order
Btrfs: catch transaction abortion after waiting for it
btrfs: fix sizeof format specifier in btrfs_check_super_valid()
If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Pull btrfs updates from Chris Mason:
"This pull is mostly cleanups and fixes:
- The raid5/6 cleanups from Zhao Lei fixup some long standing warts
in the code and add improvements on top of the scrubbing support
from 3.19.
- Josef has round one of our ENOSPC fixes coming from large btrfs
clusters here at FB.
- Dave Sterba continues a long series of cleanups (thanks Dave), and
Filipe continues hammering on corner cases in fsync and others
This all was held up a little trying to track down a use-after-free in
btrfs raid5/6. It's not clear yet if this is just made easier to
trigger with this pull or if its a new bug from the raid5/6 cleanups.
Dave Sterba is the only one to trigger it so far, but he has a
consistent way to reproduce, so we'll get it nailed shortly"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
Btrfs: don't remove extents and xattrs when logging new names
Btrfs: fix fsync data loss after adding hard link to inode
Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Btrfs: account for large extents with enospc
Btrfs: don't set and clear delalloc for O_DIRECT writes
Btrfs: only adjust outstanding_extents when we do a short write
btrfs: Fix out-of-space bug
Btrfs: scrub, fix sleep in atomic context
Btrfs: fix scheduler warning when syncing log
Btrfs: Remove unnecessary placeholder in btrfs_err_code
btrfs: cleanup init for list in free-space-cache
btrfs: delete chunk allocation attemp when setting block group ro
btrfs: clear bio reference after submit_one_bio()
Btrfs: fix scrub race leading to use-after-free
Btrfs: add missing cleanup on sysfs init failure
Btrfs: fix race between transaction commit and empty block group removal
btrfs: add more checks to btrfs_read_sys_array
btrfs: cleanup, rename a few variables in btrfs_read_sys_array
btrfs: add checks for sys_chunk_array sizes
btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
...
On our gluster boxes we stream large tar balls of backups onto our fses. With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent. The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need. To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Helper account_page_redirty() fixes dirty pages counter for redirtied
pages. This patch puts it after dirtying and prevents temporary
underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.
Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
forced data type checking in compile.
Changelog v1->v2:
Rename following by David Sterba's suggestion:
put_btrfs_bio() -> btrfs_put_bio()
get_btrfs_bio() -> btrfs_get_bio()
bbio->ref_count -> bbio->refs
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.
The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.
struct extent_state {
u64 start; /* 0 8 */
u64 end; /* 8 8 */
struct rb_node rb_node; /* 16 24 */
wait_queue_head_t wq; /* 40 24 */
/* --- cacheline 1 boundary (64 bytes) --- */
atomic_t refs; /* 64 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int state; /* 72 8 */
u64 private; /* 80 8 */
/* size: 88, cachelines: 2, members: 7 */
/* sum members: 84, holes: 1, sum holes: 4 */
/* last cacheline: 24 bytes */
};
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Suppress the following warning displayed on building 32bit (i686) kernel.
===============================================================================
...
CC [M] fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
failrec = (struct io_failure_record *)state->private;
...
===============================================================================
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Chris Mason <clm@fb.com>
Make the extent buffer allocation interface consistent. Cloned eb will
set a valid fs_info. For dummy eb, we can drop the length parameter and
set it from fs_info.
The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.
Signed-off-by: David Sterba <dsterba@suse.cz>
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Right now the only caller of find_first_extent_bit() that is interested
in caching extent states (transaction or log commit), never gets an extent
state cached. This is because find_first_extent_bit() only caches states
that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
the transaction/log commit caller always passes a tree that doesn't have
ever extent states with any of those flags (they can only have one of the
following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
This change together with the following one in the patch series (titled
"Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
help reduce significantly the chances of calls to convert_extent_bit()
fail with -ENOMEM when called from the transaction/log commit code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.
Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is actually inspired by Filipe's patch. When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));
This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch implement data repair function when direct read fails.
The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
mirror.
- When the io on the mirror ends, we will insert the endio work into the
dedicated btrfs workqueue, not common read endio workqueue, because the
original endio work is still blocked in the btrfs endio workqueue, if we
insert the endio work of the io on the mirror into that workqueue, deadlock
would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.
But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared. This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you. So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have. This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We've defined a 'offset' out of bio_for_each_segment_all.
This is just a clean rename, no function changes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The tree field of struct extent_state was only used to figure out if
an extent state was connected to an inode's io tree or not. For this
we can just use the rb_node field itself.
On a x86_64 system with this change the sizeof(struct extent_state) is
reduced from 96 bytes down to 88 bytes, meaning that with a page size
of 4096 bytes we can now store 46 extent states per page instead of 42.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"The biggest of these comes from Liu Bo, who tracked down a hang we've
been hitting since moving to kernel workqueues (it's a btrfs bug, not
in the generic code). His patch needs backporting to 3.16 and 3.15
stable, which I'll send once this is in.
Otherwise these are assorted fixes. Most were integrated last week
during KS, but I wanted to give everyone the chance to test the
result, so I waited for rc2 to come out before sending"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix task hang under heavy compressed write
Btrfs: fix filemap_flush call in btrfs_file_release
Btrfs: fix crash on endio of reading corrupted block
btrfs: fix leak in qgroup_subtree_accounting() error path
btrfs: Use right extent length when inserting overlap extent map.
Btrfs: clone, don't create invalid hole extent map
Btrfs: don't monopolize a core when evicting inode
Btrfs: fix hole detection during file fsync
Btrfs: ensure tmpfile inode is always persisted with link count of 0
Btrfs: race free update of commit root for ro snapshots
Btrfs: fix regression of btrfs device replace
Btrfs: don't consider the missing device when allocating new chunks
Btrfs: Fix wrong device size when we are resizing the device
Btrfs: don't write any data into a readonly device when scrub
Btrfs: Fix the problem that the replace destroys the seed filesystem
btrfs: Return right extent when fiemap gives unaligned offset and len.
Btrfs: fix wrong extent mapping for DirectIO
Btrfs: fix wrong write range for filemap_fdatawrite_range()
Btrfs: fix wrong missing device counter decrease
Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
...
The crash is
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
This is in fact a regression.
It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.
The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more btrfs updates from Chris Mason:
"This has a few fixes since our last pull and a new ioctl for doing
btree searches from userland. It's very similar to the existing
ioctl, but lets us return larger items back down to the app"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix error handling in create_pending_snapshot
btrfs: fix use of uninit "ret" in end_extent_writepage()
btrfs: free ulist in qgroup_shared_accounting() error path
Btrfs: fix qgroups sanity test crash or hang
btrfs: prevent RCU warning when dereferencing radix tree slot
Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
btrfs: new ioctl TREE_SEARCH_V2
btrfs: tree_search, search_ioctl: direct copy to userspace
btrfs: new function read_extent_buffer_to_user
btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
btrfs: tree_search, search_ioctl: accept varying buffer
btrfs: tree_search: eliminate redundant nr_items check
If this condition in end_extent_writepage() is false:
if (tree->ops && tree->ops->writepage_end_io_hook)
we will then test an uninitialized "ret" at:
ret = ret < 0 ? ret : -EIO;
The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.
Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.
Signed-of-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
This new function reads the content of an extent directly to user memory.
Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
Pull btrfs updates from Chris Mason:
"The biggest change here is Josef's rework of the btrfs quota
accounting, which improves the in-memory tracking of delayed extent
operations.
I had been working on Btrfs stack usage for a while, mostly because it
had become impossible to do long stress runs with slab, lockdep and
pagealloc debugging turned on without blowing the stack. Even though
you upgraded us to a nice king sized stack, I kept most of the
patches.
We also have some very hard to find corruption fixes, an awesome sysfs
use after free, and the usual assortment of optimizations, cleanups
and other fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
Btrfs: convert smp_mb__{before,after}_clear_bit
Btrfs: fix scrub_print_warning to handle skinny metadata extents
Btrfs: make fsync work after cloning into a file
Btrfs: use right type to get real comparison
Btrfs: don't check nodes for extent items
Btrfs: don't release invalid page in btrfs_page_exists_in_range()
Btrfs: make sure we retry if page is a retriable exception
Btrfs: make sure we retry if we couldn't get the page
btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
Btrfs: fix leaf corruption after __btrfs_drop_extents
Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
btrfs: free delayed node outside of root->inode_lock
btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
Btrfs: fix transaction leak during fsync call
btrfs: Avoid trucating page or punching hole in a already existed hole.
Btrfs: update commit root on snapshot creation after orphan cleanup
Btrfs: ioctl, don't re-lock extent range when not necessary
Btrfs: avoid visiting all extent items when cloning a range
...
__extent_writepage has two unrelated parts. First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.
This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.
Signed-off-by: Chris Mason <clm@fb.com>
This adds noinline_for_stack to two helpers used by
btree_write_cache_pages. It shaves us down from 424 bytes on the
stack to 280.
Signed-off-by: Chris Mason <clm@fb.com>
We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.
Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
According to commit 865ffef379
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:
[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001
[ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)
It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.
This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:
mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
mount /dev/sdd /mnt
fsstress -p 6 -d /mnt -n 100000 -x \
"btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
-f allocsp=0 \
-f bulkstat=0 \
-f bulkstat1=0 \
-f chown=0 \
-f creat=1 \
-f dread=0 \
-f dwrite=0 \
-f fallocate=1 \
-f fdatasync=0 \
-f fiemap=0 \
-f freesp=0 \
-f fsync=0 \
-f getattr=0 \
-f getdents=0 \
-f link=0 \
-f mkdir=0 \
-f mknod=0 \
-f punch=1 \
-f read=0 \
-f readlink=0 \
-f rename=0 \
-f resvsp=0 \
-f rmdir=0 \
-f setxattr=0 \
-f stat=0 \
-f symlink=0 \
-f sync=0 \
-f truncate=1 \
-f unlink=0 \
-f unresvsp=0 \
-f write=4
So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage. The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.
The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page. This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.
The primary APIs of concern in this care are the following and are used
by most filesystems.
find_get_page
find_lock_page
find_or_create_page
grab_cache_page_nowait
grab_cache_page_write_begin
All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not. Then
old API is preserved but is basically a thin wrapper around this core
function.
Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job. There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted. This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change. It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.
The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.
The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts. Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.
The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.
async dd
3.15.0-rc3 3.15.0-rc3
vanilla accessed-v2
ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)
The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.
samples percentage
ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.
Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs changes from Chris Mason:
"This is a pretty long stream of bug fixes and performance fixes.
Qu Wenruo has replaced the btrfs async threads with regular kernel
workqueues. We'll keep an eye out for performance differences, but
it's nice to be using more generic code for this.
We still have some corruption fixes and other patches coming in for
the merge window, but this batch is tested and ready to go"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
Btrfs: fix a crash of clone with inline extents's split
btrfs: fix uninit variable warning
Btrfs: take into account total references when doing backref lookup
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: remove unnecessary inode generation lookup in send
Btrfs: fix race when updating existing ref head
btrfs: Add trace for btrfs_workqueue alloc/destroy
Btrfs: less fs tree lock contention when using autodefrag
Btrfs: return EPERM when deleting a default subvolume
Btrfs: add missing kfree in btrfs_destroy_workqueue
Btrfs: cache extent states in defrag code path
Btrfs: fix deadlock with nested trans handles
Btrfs: fix possible empty list access when flushing the delalloc inodes
Btrfs: split the global ordered extents mutex
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
Btrfs: reclaim delalloc metadata more aggressively
Btrfs: remove unnecessary lock in may_commit_transaction()
Btrfs: remove the unnecessary flush when preparing the pages
Btrfs: just do dirty page flush for the inode with compression before direct IO
...
When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.
This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.
Jeff Mahoney kicked in code to export filesystem info into sysfs.
Otherwise, lots of performance improvements, cleanups and bug fixes.
Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
For creating a dummy in-memory btree I need to be able to use the radix tree to
keep track of the buffers like normal extent buffers. With dummy buffers we
skip the radix tree step, and we still want to do that for the tree mod log
dummy buffers but for my test buffers we need to be able to remove them from the
radix tree like normal. This will give me a way to do that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to add infrastructure to allocate dummy extent buffers for running sanity
tests, and to do this I need to not have to worry about having an
address_mapping for an io_tree, so just fix up the places where we assume that
all io_tree's have a non-NULL ->mapping. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we do 2 traversals of an inode's extent_io_tree
before inserting an extent state structure: 1 to see if a
matching extent state already exists and 1 to do the insertion
if the fist traversal didn't found such extent state.
This change just combines those tree traversals into a single one.
While running sysbench tests (random writes) I captured the number
of elements in extent_io_tree trees for a while (into a procfs file
backed by a seq_list from seq_file module) and got this histogram:
Count: 9310
Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688
Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000
51.000 - 93.933: 693 ########
93.933 - 172.314: 938 ##########
172.314 - 315.408: 856 #########
315.408 - 576.646: 95 #
576.646 - 6415.830: 888 ##########
6415.830 - 11713.809: 1024 ###########
11713.809 - 21386.000: 4816 #####################################################
So traversing such trees can take some significant time that can
easily be avoided.
Ran the following sysbench tests, 5 times each, for sequential and
random writes, and got the following results:
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
sysbench --test=fileio --file-num=1 --file-total-size=2G \
--file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \
--max-requests=0 --max-time=60 --file-io-mode=sync
Before this change:
sequential writes: 69.28Mb/sec (average of 5 runs)
random writes: 4.14Mb/sec (average of 5 runs)
After this change:
sequential writes: 69.91Mb/sec (average of 5 runs)
random writes: 5.69Mb/sec (average of 5 runs)
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we didn't find a matching extent state, we inserted a new one
but didn't cache it in the **cached_state parameter, which makes a
subsequent call do a tree lookup to get it.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from end_bio_extent_writepage,
* item from extent_fiemap.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:
- A fix for a use-after-free of a request in blk-mq. From Ming Lei
- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed
- Two xen-blkfront small fixes
- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent
- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
It was being open coded in a few places.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
This disables the "if needed, write the good copy back before the read
is completed" part of the read sequence for read-only mounts.
Cc: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink(). This doesn't belong in mainline.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
move_pages() has an inefficient backwards byte copy of regions of two
different pages. They're different pages so the regions won't overlap
and it could use memcpy().
At that point, though, move_pages() would be a slightly dimmer
re-implementation of copy_pages() that lacked the test for overlapping
page regions.
So remove move_pages() and just call copy_pages().
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().
Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.
While at it, this patch
- changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
with find_extent_buffer() to reduce redundancy.
- removes the unused argument 'len' to find_extent_buffer().
Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
So both Liu and I made huge messes of find_lock_delalloc_range trying to fix
stuff, me first by fixing extent size, then him by fixing something I broke and
then me again telling him to fix it a different way. So this is obviously a
candidate for some testing. This patch adds a pseudo fs so we can allocate fake
inodes for tests that need an inode or pages. Then it addes a bunch of tests to
make sure find_lock_delalloc_range is acting the way it is supposed to. With
this patch and all of our previous patches to find_lock_delalloc_range I am sure
it is working as expected now. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Similar to ocfs2, btrfs also supports that extents can be shared by
different inodes, and there are some userspace tools requesting
for this kind of 'space shared infomation'.[1]
ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs.
[1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got more bug fixes in my for-linus branch:
One of these fixes another corner of the compression oops from last
time. Miao nailed down some problems with concurrent snapshot
deletion and drive balancing.
I kept out one of his patches for more testing, but these are all
stable"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix oops caused by the space balance and dead roots
Btrfs: insert orphan roots into fs radix tree
Btrfs: limit delalloc pages outside of find_delalloc_range
Btrfs: use right root when checking for hash collision
Liu fixed part of this problem and unfortunately I steered him in slightly the
wrong direction and so didn't completely fix the problem. The problem is we
limit the size of the delalloc range we are looking for to max bytes and then we
try to lock that range. If we fail to lock the pages in that range we will
shrink the max bytes to a single page and re loop. However if our first page is
inside of the delalloc range then we will end up limiting the end of the range
to a period before our first page. This is illustrated below
[0 -------- delalloc range --------- 256mb]
[page]
So find_delalloc_range will return with delalloc_start as 0 and end as 128mb,
and then we will notice that delalloc_start < *start and adjust it up, but not
adjust delalloc_end up, so things go sideways. To fix this we need to not limit
the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that
way we don't end up with this confusion. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The crash[1] is found by xfstests/generic/208 with "-o compress",
it's not reproduced everytime, but it does panic.
The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca,
Btrfs: actually limit the size of delalloc range).
Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
ordered extent and so on.
(4) submit the page A.
It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).
The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end <= *start) and return with (found > 0).
This ends with not locking delalloc pages but making ->writepage still
process them, and the crash happens.
This fixes it by just thinking that we find nothing and returning to caller
as the caller knows how to deal with it properly.
[1]:
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8
[...]
RIP: 0010:[<ffffffff810f5093>] [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
[ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620
[ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731] [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
[ 4934.248731] [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731] [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
[ 4934.248731] [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731] [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731] [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731] [<ffffffff810608f5>] kthread+0x8d/0x95
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
[ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731] RSP <ffff8801869f9c48>
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb. This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space. Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents. This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around. So fix
this by actually enforcing the limit. With this patch I'm no longer seeing
giant 1.5gb extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
PAGE_CACHE_SIZE == PAGE_SIZE is "unsigned long" everywhere, so there's no
need to cast it to "unsigned long".
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.
All these changes should be obviously safe.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.
Partial credit for this patch goes to Gabriel de Perthuis <g2p.code@gmail.com>
as I have included a fix from him to the original patch which avoids a
deadlock on compressed extents.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch removes the io_tree argument for extent_clear_unlock_delalloc since
we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree
operations from the page operations. This way we just pass in the extent bits
we want to clear and then pass in the operations we want done to the pages.
This is because I'm going to fix what extent bits we clear in some cases and
rather than add a bunch of new flags we'll just use the actual extent bits we
want to clear. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we read several pages at once, we needn't get the extent map object
every time we deal with a page, and we can cache the extent map object.
So, we can reduce the search time of the extent map, and besides that, we
also can reduce the lock contention of the extent map tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the past, we cached the checksum value in the extent state object, so we
had to split the extent state object by the block size, or we had no space
to keep this checksum value. But it increased the lock contention of the
extent state tree.
Now we removed this limit by caching the checksum into the bio object, so
it is unnecessary to do the extent state operations by the block size, we
can do it in batches, in this way, we can reduce the lock contention of
the extent state tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we set the uptodate flag and unlock the extent
by the page size, it is unnecessary, we can do it in batches, it can reduce
the lock contention of the extent state tree.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.
Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch add some branch prediction hints into the end IO function
of the read page, it reduced the percentage of the branch misses from
5.5% to 4.9%.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset. This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed. So we can return a physical value that isn't at all within
the area we have allocated on disk. Fix this by returning the start of the
extent if it is compressed no matter what the offset. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"These are the usual mixture of bugs, cleanups and performance fixes.
Miao has some really nice tuning of our crc code as well as our
transaction commits.
Josef is peeling off more and more problems related to early enospc,
and has a number of important bug fixes in here too"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
Btrfs: wait ordered range before doing direct io
Btrfs: only do the tree_mod_log_free_eb if this is our last ref
Btrfs: hold the tree mod lock in __tree_mod_log_rewind
Btrfs: make backref walking code handle skinny metadata
Btrfs: fix crash regarding to ulist_add_merge
Btrfs: fix several potential problems in copy_nocow_pages_for_inode
Btrfs: cleanup the code of copy_nocow_pages_for_inode()
Btrfs: fix oops when recovering the file data by scrub function
Btrfs: make the chunk allocator completely tree lockless
Btrfs: cleanup orphaned root orphan item
Btrfs: fix wrong mirror number tuning
Btrfs: cleanup redundant code in btrfs_submit_direct()
Btrfs: remove btrfs_sector_sum structure
Btrfs: check if we can nocow if we don't have data space
Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
Btrfs: use a percpu to keep track of possibly pinned bytes
Btrfs: check for actual acls rather than just xattrs when caching no acl
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
Btrfs: optimize reada_for_balance
Btrfs: optimize read_block_for_search
...
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write. This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue. With this patch we now pass xfstests
generic/274. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This has plagued us forever and I'm so over working around it. When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point. The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things. This also tends to bite the orphan
cleanup stuff too which keeps people from mounting. To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data. This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Pull btrfs fixes from Chris Mason:
"Miao Xie has been very busy, fixing races and enospc problems and many
other small but important pieces.
Alexandre Oliva discovered some problems with how our error handling
was interacting with the block layer and for now has disabled our
partial handling of sub-page writes. The real sub-page work is in a
series of patches from IBM that we still need to integrate and test.
The code Alexandre has turned off was really incomplete.
Josef has more error handling fixes and an important fix for the new
skinny extent format.
This also has my fix for the tracepoint crash from late in 3.9. It's
the first stage in a larger clean up to get rid of btrfs_bio and make
a proper bioset for all the items we need to tack into the bio. For
now the bioset only holds our mirror_num and stripe_index, but for the
next merge window I'll shuffle more in."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs: make sure roots are assigned before freeing their nodes
Btrfs: explicitly use global_block_rsv for quota_tree
btrfs: do away with non-whole_page extent I/O
Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
Btrfs: pause the space balance when remounting to R/O
Btrfs: fix unprotected root node of the subvolume's inode rb-tree
Btrfs: fix accessing a freed tree root
Btrfs: return errno if possible when we fail to allocate memory
Btrfs: update the global reserve if it is empty
Btrfs: don't steal the reserved space from the global reserve if their space type is different
Btrfs: optimize the error handle of use_block_rsv()
Btrfs: don't use global block reservation for inode cache truncation
Btrfs: don't abort the current transaction if there is no enough space for inode cache
Correct allowed raid levels on balance.
Btrfs: fix possible memory leak in replace_path()
Btrfs: fix possible memory leak in the find_parent_nodes()
Btrfs: don't allow device replace on RAID5/RAID6
Btrfs: handle running extent ops with skinny metadata
...
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.
As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.
Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.
This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error. This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.
It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case. Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!
I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
lock_extent/unlock_extent expect an exclusive end.
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Pull btrfs update from Chris Mason:
"These are mostly fixes. The biggest exceptions are Josef's skinny
extents and Jan Schmidt's code to rebuild our quota indexes if they
get out of sync (or you enable quotas on an existing filesystem).
The skinny extents are off by default because they are a new variation
on the extent allocation tree format. btrfstune -x enables them, and
the new format makes the extent allocation tree about 30% smaller.
I rebased this a few days ago to rework Dave Sterba's crc checks on
the super block, but almost all of these go back to rc6, since I
though 3.9 was due any minute.
The biggest missing fix is the tracepoint bug that was hit late in
3.9. I ran into problems with that in overnight testing and I'm still
tracking it down. I'll definitely have that fixed for rc2."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
Btrfs: allow superblock mismatch from older mkfs
btrfs: enhance superblock checks
btrfs: fix misleading variable name for flags
btrfs: use unsigned long type for extent state bits
Btrfs: improve the loop of scrub_stripe
btrfs: read entire device info under lock
btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
btrfs: handle errors returned from get_tree_block_key
btrfs: make static code static & remove dead code
Btrfs: deal with errors in write_dev_supers
Btrfs: remove almost all of the BUG()'s from tree-log.c
Btrfs: deal with free space cache errors while replaying log
Btrfs: automatic rescan after "quota enable" command
Btrfs: rescan for qgroups
Btrfs: split btrfs_qgroup_account_ref into four functions
Btrfs: allocate new chunks if the space is not enough for global rsv
Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
btrfs: move leak debug code to functions
Btrfs: return free space in cow error path
Btrfs: set UUID in root_item for created trees
...
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Clean up the leak debugging in extent_io.c by moving
the debug code into functions. This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.
Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.
Thanks to Dave Sterba for the Kconfig bit.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.
According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).
# dd if=<mnt>/file of=/dev/null bs=1M count=1024
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
set_extent_bit()'s (u64 *failed_start) expects NULL not 0.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Btrfs uses page_mkwrite to ensure stable pages during
crc calculations and mmap workloads. We call clear_page_dirty_for_io
before we do any crcs, and this forces any application with the file
mapped to wait for the crc to finish before it is allowed to change
the file.
With compression on, the clear_page_dirty_for_io step is happening after
we've compressed the pages. This means the applications might be
changing the pages while we are compressing them, and some of those
modifications might not hit the disk.
This commit adds the clear_page_dirty_for_io before compression starts
and makes sure to redirty the page if we have to fallback to
uncompressed IO as well.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Alexandre Oliva <oliva@gnu.org>
cc: stable@vger.kernel.org
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
We want to avoid module.h where posible, since it in turn includes
nearly all of header space. This means removing it where it is not
required, and using export.h where we are only exporting symbols via
EXPORT_SYMBOL and friends.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The nodesize is capped at 64k and there are enough pages preallocated in
extent_buffer::inline_pages. The fallback to kmalloc never happened
because even on the smallest page size considered (4k) inline_pages
covered the needs.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
u64 mask = ((u64)root->stripesize - 1);
u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------
Sometimes these open-coded alignment is not so easy to understand for
newbie like me.
This commit changes the open-coded alignment to the ALIGN macro for a
better readability.
Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html
Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.
This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Use wrapper page_offset to get byte-offset into filesystem object for page.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"This has our series of fixes for the next rc. The biggest batch is
from Jan Schmidt, fixing up some problems in our subvolume quota code
and fixing btrfs send/receive to work with the new extended inode
refs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: do not bug when we fail to commit the transaction
Btrfs: fix memory leak when cloning root's node
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
Btrfs: fix deadlock caused by the nested chunk allocation
btrfs: Return EINVAL when length to trim is less than FSB
Btrfs: fix memory leak in btrfs_quota_enable()
Btrfs: send correct rdev and mode in btrfs-send
Btrfs: extended inode refs support for send mechanism
Btrfs: Fix wrong error handling code
Fix a sign bug causing invalid memory access in the ino_paths ioctl.
Btrfs: comment for loop in tree_mod_log_insert_move
Btrfs: fix extent buffer reference for tree mod log roots
Btrfs: determine level of old roots
Btrfs: tree mod log's old roots could still be part of the tree
Btrfs: fix a tree mod logging issue for root replacement operations
Btrfs: don't put removals from push_node_left into tree mod log twice
gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Pull btrfs update from Chris Mason:
"This is a large pull, with the bulk of the updates coming from:
- Hole punching
- send/receive fixes
- fsync performance
- Disk format extension allowing more hardlinks inside a single
directory (btrfs-progs patch required to enable the compat bit for
this one)
I'm cooking more unrelated RAID code, but I wanted to make sure this
original batch makes it in. The largest updates here are relatively
old and have been in testing for some time."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
btrfs: init ref_index to zero in add_inode_ref
Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
Btrfs: fix page leakage
Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
Btrfs: don't bug on enomem in readpage
Btrfs: cleanup pages properly when ENOMEM in compression
Btrfs: make filesystem read-only when submitting barrier fails
Btrfs: detect corrupted filesystem after write I/O errors
Btrfs: make compress and nodatacow mount options mutually exclusive
btrfs: fix message printing
Btrfs: don't bother committing delayed inode updates when fsyncing
btrfs: move inline function code to header file
Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
btrfs: remove unused function btrfs_insert_some_items()
Btrfs: don't commit instead of overcommitting
Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
Btrfs: be smarter about dropping things from the tree log
Btrfs: don't lookup csums for prealloc extents
Btrfs: cache extent state when writing out dirty metadata pages
Btrfs: do not hold the file extent leaf locked when adding extent item
...
Alloc_dummy_extent_buffer will not free the first page in the eb array if we
fail to allocate a page, fix this. Thanks,
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It's just annoying and the user will have gotten a nice OOM killer message
so they are already fully aware they are screwed :). Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Get rid of the BUG_ON(ret == -ENOMEM) in __extent_read_full_page. Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When building btrfs from kernel code, it will report:
fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called
fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here
fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called
fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here
because of the wrong declaration of inline functions.
Signed-off-by: Robin Dong <sanbai@taobao.com>
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help. The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost. The other case is for our tree log. We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can race when checking wether PagePrivate is set on a page and we
actually have an eb saved in the pages private pointer. We could have
easily written out this page and released it in the time that we did the
pagevec lookup and actually got around to looking at this page. So use
mapping->private_lock to ensure we get a consistent view of the
page->private pointer. This is inline with the alloc and releasepage paths
which use private_lock when manipulating page->private. Thanks,
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.
scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.
The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
We're going to use this flag EXTENT_DEFRAG to indicate which range
belongs to defragment so that we can implement snapshow-aware defrag:
We set the EXTENT_DEFRAG flag when dirtying the extents that need
defragmented, so later on writeback thread can differentiate between
normal writeback and writeback started by defragmentation.
Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk. If we're compressed, this isn't
true, and so it was off into extents owned by other files.
It was also improperly handling inline extents. This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent. It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"I've split out the big send/receive update from my last pull request
and now have just the fixes in my for-linus branch. The send/recv
branch will wander over to linux-next shortly though.
The largest patches in this pull are Josef's patches to fix DIO
locking problems and his patch to fix a crash during balance. They
are both well tested.
The rest are smaller fixes that we've had queued. The last rc came
out while I was hacking new and exciting ways to recover from a
misplaced rm -rf on my dev box, so these missed rc3."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: fix that repair code is spuriously executed for transid failures
Btrfs: fix ordered extent leak when failing to start a transaction
Btrfs: fix a dio write regression
Btrfs: fix deadlock with freeze and sync V2
Btrfs: revert checksum error statistic which can cause a BUG()
Btrfs: remove superblock writing after fatal error
Btrfs: allow delayed refs to be merged
Btrfs: fix enospc problems when deleting a subvol
Btrfs: fix wrong mtime and ctime when creating snapshots
Btrfs: fix race in run_clustered_refs
Btrfs: don't run __tree_mod_log_free_eb on leaves
Btrfs: increase the size of the free space cache
Btrfs: barrier before waitqueue_active
Btrfs: fix deadlock in wait_for_more_refs
btrfs: fix second lock in btrfs_delete_delayed_items()
Btrfs: don't allocate a seperate csums array for direct reads
Btrfs: do not strdup non existent strings
Btrfs: do not use missing devices when showing devname
Btrfs: fix that error value is changed by mistake
Btrfs: lock extents as we map them in DIO
...
Commit 442a4f6308 added btrfs device
statistic counters for detected IO and checksum errors to Linux 3.5.
The statistic part that counts checksum errors in
end_bio_extent_readpage() can cause a BUG() in a subfunction:
"kernel BUG at fs/btrfs/volumes.c:3762!"
That part is reverted with the current patch.
However, the counting of checksum errors in the scrub context remains
active, and the counting of detected IO errors (read, write or flush
errors) in all contexts remains active.
Cc: stable <stable@vger.kernel.org> # 3.5
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pull large btrfs update from Chris Mason:
"This pull request is very large, and the two main features in here
have been under testing/devel for quite a while.
We have subvolume quotas from the strato developers. This enables
full tracking of how many blocks are allocated to each subvolume (and
all snapshots) and you can set limits on a per-subvolume basis. You
can also create quota groups and toss multiple subvolumes into a big
group. It's everything you need to be a web hosting company and give
each user their own subvolume.
The userland side of the quotas is being refreshed, they'll send out
details on where to grab it soon.
Next is the kernel side of btrfs send/receive from Alexander Block.
This leverages the same infrastructure as the quota code to figure out
relationships between blocks and their owners. It can then compute
the difference between two snapshots and sends the diffs in a neutral
format into userland.
The basic model:
create a snapshot
send that snapshot as the initial backup
make changes
create a second snapshot
send the incremental as a backup
delete the first snapshot
(use the second snapshot for the next incremental)
The receive portion is all in userland, and in the 'next' branch of my
btrfs-progs repo.
There's still some work to do in terms of optimizing the send side
from kernel to userland. The really important part is figuring out
how two snapshots are different, and this is where we are
concentrating right now. The initial send of a dataset is a little
slower than tar, but the incremental sends are dramatically faster
than what rsync can do.
On top of all of that, we have a nice queue of fixes, cleanups and
optimizations."
Fix up trivial modify/del conflict in fs/btrfs/ioctl.c
Also fix up semantic conflict in fs/btrfs/send.c: the interface to
dentry_open() changed in commit 765927b2d5 ("switch dentry_open() to
struct path, make it grab references itself"), and since it now grabs
whatever references it needs, we should no longer do the mntget() on the
mnt (and we need to dput() the dentry reference we took).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
Btrfs: uninit variable fixes in send/receive
Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
Btrfs: add btrfs_compare_trees function
Btrfs: introduce subvol uuids and times
Btrfs: make iref_to_path non static
Btrfs: add a barrier before a waitqueue_active check
Btrfs: call the ordered free operation without any locks held
Btrfs: Check INCOMPAT flags on remount and add helper function
Btrfs: add helper for tree enumeration
btrfs: allow cross-subvolume file clone
Btrfs: improve multi-thread buffer read
Btrfs: make btrfs's allocation smoothly with preallocation
Btrfs: lock the transition from dirty to writeback for an eb
Btrfs: fix potential race in extent buffer freeing
Btrfs: don't return true in releasepage unless we actually freed the eb
Btrfs: suppress printk() if all device I/O stats are zero
Btrfs: remove unwanted printk() for btrfs device I/O stats
Btrfs: rewrite BTRFS_SETGET_FUNCS
Btrfs: zero unused bytes in inode item
Btrfs: kill free_space pointer from inode structure
...
Conflicts:
fs/btrfs/ioctl.c
Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.
Here is a scenario in fio jobs:
We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.
And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:
t1 t2 t3 t4
add Page1
read Page1 add Page2
| read Page2 add Page3
| | read Page3 add Page4
| | | read Page4
-----|------------|-----------|-----------|--------
v v v v
bio bio bio bio
Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio. Thus, we can end up with far more bios
than we need.
Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.
With that said, we can make each bio hold more pages and reduce the number
of bios we need.
Here is some numbers taken from fio results:
w/o patch w patch
------------- -------- ---------------
READ: 745MB/s +25% 934MB/s
[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/
[READ]
filename=foobar
size=2000M
invalidate=1
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is a small window where an eb can have no IO bits set on it, which
could potentially result in extent_buffer_under_io() returning false when we
want it to return true, which could result in not fun things happening. So
in order to protect this case we need to hold the refs_lock when we make
this transition to make sure we get reliable results out of
extent_buffer_udner_io(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This sounds sort of impossible but it is the only thing I can think of and
at the very least it is theoretically possible so here it goes.
If we are in try_release_extent_buffer we will check that the ref count on
the extent buffer is 1 and not under IO, and then go down and clear the tree
ref. If between this check and clearing the tree ref somebody else comes in
and grabs a ref on the eb and the marks it dirty before
try_release_extent_buffer() does it's tree ref clear we can end up with a
dirty eb that will be freed while it is still dirty which will result in a
panic. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed while looking at an extent_buffer race that we will
unconditionally return 1 if we get down to release_extent_buffer after
clearing the tree ref. However we can easily race in here and get a ref on
the eb and not actually free the eb. So make release_extent_buffer return 1
if it free'd the eb and 0 if not so we can be a little kinder to the vm.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can race with unlink and not actually be able to do our igrab in
btrfs_add_ordered_extent. This will result in all sorts of problems.
Instead of doing the complicated work to try and handle returning an error
properly from btrfs_add_ordered_extent, just hold a ref to the inode during
writepages. If we cannot grab a ref we know we're freeing this inode anyway
and can just drop the dirty pages on the floor, because screw them we're
going to invalidate them anyway. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Fully utilize our extent state's new helper functions to use
fastpath as much as possible.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
We noticed that the ordered extent completion doesn't really rely on having
a page and that it could be done independantly of ending the writeback on a
page. This patch makes us not do the threaded endio stuff for normal
buffered writes and direct writes so we can end page writeback as soon as
possible (in irq context) and only start threads to do the ordered work when
it is actually done. Compression needs to be reworked some to take
advantage of this as well, but atm it has to do a find_get_page in its endio
handler so it must be done in its own thread. This makes direct writes
quite a bit faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
These warnings are bogus since we will always have at least one page in an
eb, but to make the compiler happy just set ret = 0 in these two cases.
Thanks,
Btrfs: fix compile warnings in extent_io.c
These warnings are bogus since we will always have at least one page in an
eb, but to make the compiler happy just set ret = 0 in these two cases.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The tree modification log needs two ways to create dummy extent buffers,
once by allocating a fresh one (to rebuild an old root) and once by
cloning an existing one (to make private rewind modifications) to it.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In tree_insert, var *entry is used in the loop only, and is useless
out of the loop. Remove the useless assignment after the loop.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
The return value of find_first_extent_bit is 1 or 0, no < 0.
And if found something, return 0; if nothing was found, return 1.
Fix the comment.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
num_extent_pages returns the number of pages in the specific range, not
the index of the last page in the eb range.
btrfs_release_extent_buffer_page is called with start_idx set 0 in current
codes, so it's not a problem yet. But the logic is indeed wrong.
Fix it here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Pull btrfs fixes from Chris Mason:
"The big ones here are a memory leak we introduced in rc1, and a
scheduling while atomic if the transid on disk doesn't match the
transid we expected. This happens for corrupt blocks, or out of date
disks.
It also fixes up the ioctl definition for our ioctl to resolve logical
inode numbers. The __u32 was a merging error and doesn't match what
we ship in the progs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: avoid sleeping in verify_parent_transid while atomic
Btrfs: fix crash in scrub repair code when device is missing
btrfs: Fix mismatching struct members in ioctl.h
Btrfs: fix page leak when allocing extent buffers
Btrfs: Add properly locking around add_root_to_dirty_list
If we happen to alloc a extent buffer and then alloc a page and notice that
page is already attached to an extent buffer, we will only unlock it and
free our existing eb. Any pages currently attached to that eb will be
properly freed, but we don't do the page_cache_release() on the page where
we noticed the other extent buffer which can cause us to leak pages and I
hope cause the weird issues we've been seeing in this area. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pull btrfs fixes from Chris Mason:
"This has our collection of bug fixes. I missed the last rc because I
thought our patches were making NFS crash during my xfs test runs.
Turns out it was an NFS client bug fixed by someone else while I tried
to bisect it.
All of these fixes are small, but some are fairly high impact. The
biggest are fixes for our mount -o remount handling, a deadlock due to
GFP_KERNEL allocations in readdir, and a RAID10 error handling bug.
This was tested against both 3.3 and Linus' master as of this morning."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits)
Btrfs: reduce lock contention during extent insertion
Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
Btrfs: Fix space checking during fs resize
Btrfs: fix block_rsv and space_info lock ordering
Btrfs: Prevent root_list corruption
Btrfs: fix repair code for RAID10
Btrfs: do not start delalloc inodes during sync
Btrfs: fix that check_int_data mount option was ignored
Btrfs: don't count CRC or header errors twice while scrubbing
Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
btrfs: don't return EINTR
Btrfs: double unlock bug in error handling
Btrfs: always store the mirror we read the eb from
fs/btrfs/volumes.c: add missing free_fs_devices
btrfs: fix early abort in 'remount'
Btrfs: fix max chunk size check in chunk allocator
Btrfs: add missing read locks in backref.c
Btrfs: don't call free_extent_buffer twice in iterate_irefs
Btrfs: Make free_ipath() deal gracefully with NULL pointers
Btrfs: avoid possible use-after-free in clear_extent_bit()
...
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
clear_extent_bit()
{
next_node = rb_next(&state->rb_node);
...
clear_state_bit(state); <-- this may free next_node
if (next_node) {
state = rb_entry(next_node);
...
}
}
clear_state_bit() calls merge_state() which may free the next node
of the passing extent_state, so clear_extent_bit() may end up
referencing freed memory.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Currently it returns a set of bits that were cleared, but this return
value is not used at all.
Moreover it doesn't seem to be useful, because we may clear the bits
of a few extent_states, but only the cleared bits of last one is
returned.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Pull the minimal btrfs branch from Chris Mason:
"We have a use-after-free in there, along with errors when mount -o
discard is enabled, and a BUG_ON(we should compile with UP more
often)."
* 'for-linus-min' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use commit root when loading free space cache
Btrfs: fix use-after-free in __btrfs_end_transaction
Btrfs: check return value of bio_alloc() properly
Btrfs: remove lock assert from get_restripe_target()
Btrfs: fix eof while discarding extents
Btrfs: fix uninit variable in repair_eb_io_failure
Revert "Btrfs: increase the global block reserve estimates"
bio_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We'd have to be passing bogus extent buffers for this uninit variable to
actually be used, but set it to zero just in case.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pull btrfs fixes and features from Chris Mason:
"We've merged in the error handling patches from SuSE. These are
already shipping in the sles kernel, and they give btrfs the ability
to abort transactions and go readonly on errors. It involves a lot of
churn as they clarify BUG_ONs, and remove the ones we now properly
deal with.
Josef reworked the way our metadata interacts with the page cache.
page->private now points to the btrfs extent_buffer object, which
makes everything faster. He changed it so we write an whole extent
buffer at a time instead of allowing individual pages to go down,,
which will be important for the raid5/6 code (for the 3.5 merge
window ;)
Josef also made us more aggressive about dropping pages for metadata
blocks that were freed due to COW. Overall, our metadata caching is
much faster now.
We've integrated my patch for metadata bigger than the page size.
This allows metadata blocks up to 64KB in size. In practice 16K and
32K seem to work best. For workloads with lots of metadata, this cuts
down the size of the extent allocation tree dramatically and fragments
much less.
Scrub was updated to support the larger block sizes, which ended up
being a fairly large change (thanks Stefan Behrens).
We also have an assortment of fixes and updates, especially to the
balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and
the defragging code (Liu Bo)."
Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to
removal of the second argument to k[un]map_atomic() in commit
7ac687d9e0.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits)
Btrfs: update the checks for mixed block groups with big metadata blocks
Btrfs: update to the right index of defragment
Btrfs: do not bother to defrag an extent if it is a big real extent
Btrfs: add a check to decide if we should defrag the range
Btrfs: fix recursive defragment with autodefrag option
Btrfs: fix the mismatch of page->mapping
Btrfs: fix race between direct io and autodefrag
Btrfs: fix deadlock during allocating chunks
Btrfs: show useful info in space reservation tracepoint
Btrfs: don't use crc items bigger than 4KB
Btrfs: flush out and clean up any block device pages during mount
btrfs: disallow unequal data/metadata blocksize for mixed block groups
Btrfs: enhance superblock sanity checks
Btrfs: change scrub to support big blocks
Btrfs: minor cleanup in scrub
Btrfs: introduce common define for max number of mirrors
Btrfs: fix infinite loop in btrfs_shrink_device()
Btrfs: fix memory leak in resolver code
Btrfs: allow dup for data chunks in mixed mode
Btrfs: validate target profiles only if we are going to use them
...
Since we need to read and write extent buffers in their entirety we can't use
the normal bio_readpage_error stuff since it only works on a per page basis. So
instead make it so that if we see an io error in endio we just mark the eb as
having an IO error and then in btree_read_extent_buffer_pages we will manually
try other mirrors and then overwrite the bad mirror if we find a good copy.
This works with larger than page size blocks. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch simplifies how we track our extent buffers. Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because an eb can have multiple pages we need to make sure that all pages within
the eb are markes as accessed, since releasepage can be called against any page
in the eb. This will keep us from possibly evicting hot eb's when we're doing
larger than pagesize eb's. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory. So instead of evicting these pages, we
could end up evicting things we actually care about. Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks. This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We can run into a problem where we find an eb for our existing page already on
the radix tree but it has a ref count of 0. It hasn't yet been removed by RCU
yet so this can cause issues where we will use the EB after free. So do
atomic_inc_not_zero on the exists->refs and if it is zero just do
synchronize_rcu() and try again. We won't have to worry about new allocators
coming in since they will block on the page lock at this point. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We spend a lot of time looking up extent buffers from pages when we could just
store the pointer to the eb the page is associated with in page->private. This
patch does just that, and it makes things a little simpler and reduces a bit of
CPU overhead involved with doing metadata IO. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling. This fixes the code to properly support
large metadata blocks again.
Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.
This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.
This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
set_extent_bit can do exclusive locking but only when called by lock_extent*,
Drop the exclusive bits argument except when called by lock_extent.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
argument and use GFP_NOFS consistently.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.
It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
In submit_extent_page, there's a visually noisy if statement that, in
the midst of other conditions, does the tree dependency for tree->ops
and tree->ops->merge_bio_hook before calling it, and then another
condition afterwards. If an error is returned from merge_bio_hook,
there's no way to catch it. It's considered a routine "1" return
value instead of a failure.
This patch factors out the dependency check into a new local merge_bio
routine and BUG's on an error. The if statement is less noisy as a side-
effect.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
There is only one caller of clear_extent_bit that checks the return value
and it only checks if it's negative. Since there are no users of the
returned bits functionality of clear_extent_bit, stop returning it
and avoid complicating error handling.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The *_state functions can only return 0 or -EEXIST. This patch addresses
the cases where those functions returning -EEXIST represent a locking
failure. It handles them by panicking with an appropriate error message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
If btrfs reads a block and finds a parent transid mismatch, it clears
the uptodate flags on the extent buffer, and the pages inside it. But
we only clear the uptodate bits in the state tree if the block straddles
more than one page.
This is from an old optimization from to reduce contention on the extent
state tree. But it is buggy because the code that retries a read from
a different copy of the block is going to find the uptodate state bits
set and skip the IO.
The end result of the bug is that we'll never actually read the good
copy (if there is one).
The fix here is to always clear the uptodate state bits, which is safe
because this code is only called when the parent transid fails.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Clearing a range's bits is different with setting them, since we don't
need to touch them when states do not contain bits we want.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
We encountered an issue that was easily observable on s/390 systems but
could really happen anywhere. The timing just seemed to hit reliably
on s/390 with limited memory.
The gist is that when an unexpected set_page_dirty() happened, we'd
run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
properly set up for delalloc.
This patch does the following:
- Performs the missing delalloc in the fixup worker
- Allow the start hook to return -EBUSY which informs __extent_writepage
that it should mark the page skipped and not to redirty it. This is
required since the fixup worker can fail with -ENOSPC and the page
will have already been redirtied. That causes an Oops in
drop_outstanding_extents later. Retrying the fixup worker could
lead to an infinite loop. Deferring the page redirty also saves us
some cycles since the page would be stuck in a resubmit-redirty loop
until the fixup worker completes. It's not harmful, just wasteful.
- If the fixup worker fails, we mark the page and mapping as errored,
and end the writeback, similar to what we would do had the page
actually been submitted to writeback.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
A user has encountered a NULL pointer kernel oops in btrfs when
encountering media errors. The problem has been identified
as an unhandled NULL pointer returned from find_get_page().
This modification simply checks for a NULL page, and returns
with an error if found (the extent_range_uptodate() function
returns 1 on errors).
After testing this patch, the user reported that the error with
the NULL pointer oops was solved. However, there is still a
remaining problem with a thread becoming stuck in
wait_on_page_locked(page) in the read_extent_buffer_pages(...)
function in extent_io.c
for (i = start_i; i < num_pages; i++) {
page = extent_buffer_page(eb, i);
wait_on_page_locked(page);
if (!PageUptodate(page))
ret = -EIO;
}
This patch leaves the issue with the locked page yet to be resolved.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.
Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.
The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.
This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch casts to unsigned long before casting to a pointer and fixes
the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Failure testing was tripping up over stale PageError bits in
metadata pages. If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.
During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.
This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit. In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.
Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.
Changes for v2:
- eliminate race condition
Signed-off-by: Arne Jansen <sensille@gmx.net>
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.
Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK
Change v6:
- fix bug introduced in v5
Signed-off-by: Arne Jansen <sensille@gmx.net>
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.
Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
Btrfs: don't call writepages from within write_full_page
Btrfs: Remove unused variable 'last_index' in file.c
Btrfs: clean up for find_first_extent_bit()
Btrfs: clean up for wait_extent_bit()
Btrfs: clean up for insert_state()
Btrfs: remove unused members from struct extent_state
Btrfs: clean up code for merging extent maps
Btrfs: clean up code for extent_map lookup
Btrfs: clean up search_extent_mapping()
Btrfs: remove redundant code for dir item lookup
Btrfs: make acl functions really no-op if acl is not enabled
Btrfs: remove remaining ref-cache code
Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
Btrfs: use wait_event()
Btrfs: check the nodatasum flag when writing compressed files
Btrfs: copy string correctly in INO_LOOKUP ioctl
Btrfs: don't print the leaf if we had an error
btrfs: make btrfs_set_root_node void
Btrfs: fix oops while writing data to SSD partitions
Btrfs: Protect the readonly flag of block group
...
Fix up trivial conflicts (due to acl and writeback cleanups) in
- fs/btrfs/acl.c
- fs/btrfs/ctree.h
- fs/btrfs/extent_io.c
When doing a writepage we call writepages to try and write out any other dirty
pages in the area. This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can just use cond_resched_lock().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The set/clear bit and the extent split/merge hooks only ever return 0.
Changing them to return void simplifies the error handling cases later.
This patch changes the hook prototypes, the single implementation of each,
and the functions that call them to return void instead.
Since all four of these hooks execute under a spinlock, they're necessarily
simple.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
Btrfs: use the commit_root for reading free_space_inode crcs
Btrfs: reduce extent_state lock contention for metadata
Btrfs: remove lockdep magic from btrfs_next_leaf
Btrfs: make a lockdep class for each root
Btrfs: switch the btrfs tree locks to reader/writer
Btrfs: fix deadlock when throttling transactions
Btrfs: stop using highmem for extent_buffers
Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
Btrfs: tag pages for writeback in sync
Btrfs: fix enospc problems with delalloc
Btrfs: don't flush delalloc arbitrarily
Btrfs: use find_or_create_page instead of grab_cache_page
Btrfs: use a worker thread to do caching
Btrfs: fix how we merge extent states and deal with cached states
Btrfs: use the normal checksumming infrastructure for free space cache
Btrfs: serialize flushers in reserve_metadata_bytes
Btrfs: do transaction space reservation before joining the transaction
Btrfs: try to only do one btrfs_search_slot in do_setxattr
For metadata buffers that don't straddle pages (all of them), btrfs
can safely use the page uptodate bits and extent_buffer uptodate bit
instead of needing to use the extent_state tree.
This greatly reduces contention on the state tree lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.
The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.
It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.
In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.
We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.
The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.
This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody else does this, we need to do it too. If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed. This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree. So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state. This way we are sure to cache the correct state. Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
Btrfs: use the device_list_mutex during write_dev_supers
Btrfs: setup free ino caching in a more asynchronous way
btrfs scrub: don't coalesce pages that are logically discontiguous
Btrfs: return -ENOMEM in clear_extent_bit
Btrfs: add mount -o auto_defrag
Btrfs: using rcu lock in the reader side of devices list
Btrfs: drop unnecessary device lock
Btrfs: fix the race between remove dev and alloc chunk
Btrfs: fix the race between reading and updating devices
Btrfs: fix bh leak on __btrfs_open_devices path
Btrfs: fix unsafe usage of merge_state
Btrfs: allocate extent state and check the result properly
fs/btrfs: Add missing btrfs_free_path
Btrfs: check return value of btrfs_inc_extent_ref()
Btrfs: return error to caller if read_one_inode() fails
Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Btrfs: return error code to caller when btrfs_del_item fails
Btrfs: return error code to caller when btrfs_previous_item fails
btrfs: fix typo 'testeing' -> 'testing'
btrfs: typo: 'btrfS' -> 'btrfs'
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
xen: cleancache shim to Xen Transcendent Memory
ocfs2: add cleancache support
ext4: add cleancache support
btrfs: add cleancache support
ext3: add cleancache support
mm/fs: add hooks to support cleancache
mm: cleancache core ops functions and config
fs: add field to superblock to support cleancache
mm/fs: cleancache documentation
Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
allowed wait
- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
we trigger BUG_ON() if it's fail
- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
the return value of clear_extent_bit() is ignored by many callers
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results. For example, if the range
[0-8192]
is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096]. So instead set range_start = max(cur_start, state->start).
This makes everything come out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.
So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.
There are more of them around (mostly network drivers), but this gets
many core ones.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>