Currently device state is being managed by each individual int
variable such as struct btrfs_device::in_fs_metadata. Instead of
that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use
the bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently device state is being managed by each individual int
variable such as struct btrfs_device::writeable. Instead of that
declare device state BTRFS_DEV_STATE_WRITEABLE and use the
bit operations.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ whitespace adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
The maximum size of a checksum buffer is known, BTRFS_CSUM_SIZE, and we
don't have to allocate it dynamically. This code path is not used at all
as we have only the crc32c and use an on-stack buffer already.
Signed-off-by: David Sterba <dsterba@suse.com>
We take the fs_devices::device_list_mutex mutex in write_all_supers
which will prevent any add/del changes to the device list. Therefore we
don't need to use the RCU variant list_for_each_entry_rcu in any of the
called functions.
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_create_tree() will unconditionally generate UUID for any root.
So for quota tree and data reloc tree created by kernel, they will have
unique UUIDs.
However UUID in root item is only referred by UUID tree, which only
records UUID for fs trees. This makes unique UUIDs for quota/data reloc
tree meaningless.
Leave the UUID as zero for non-fs tree, making btrfs-debug-tree output
less confusing.
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add injectable error types for each error-injectable function.
One motivation of error injection test is to find software flaws,
mistakes or mis-handlings of expectable errors. If we find such
flaws by the test, that is a program bug, so we need to fix it.
But if the tester miss input the error (e.g. just return success
code without processing anything), it causes unexpected behavior
even if the caller is correctly programmed to handle any errors.
That is not what we want to test by error injection.
To clarify what type of errors the caller must expect for each
injectable function, this introduces injectable error types:
- EI_ETYPE_NULL : means the function will return NULL if it
fails. No ERR_PTR, just a NULL.
- EI_ETYPE_ERRNO : means the function will return -ERRNO
if it fails.
- EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO
(ERR_PTR) or NULL.
ALLOW_ERROR_INJECTION() macro is expanded to get one of
NULL, ERRNO, ERRNO_NULL to record the error type for
each function. e.g.
ALLOW_ERROR_INJECTION(open_ctree, ERRNO)
This error types are shown in debugfs as below.
====
/ # cat /sys/kernel/debug/error_injection/list
open_ctree [btrfs] ERRNO
io_ctl_init [btrfs] ERRNO
====
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since error-injection framework is not limited to be used
by kprobes, nor bpf. Other kernel subsystems can use it
freely for checking safeness of error-injection, e.g.
livepatch, ftrace etc.
So this separate error-injection framework from kprobes.
Some differences has been made:
- "kprobe" word is removed from any APIs/structures.
- BPF_ALLOW_ERROR_INJECTION() is renamed to
ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
- CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
feature. It is automatically enabled if the arch supports
error injection feature for kprobe or ftrace etc.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2017-12-18
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow arbitrary function calls from one BPF function to another BPF function.
As of today when writing BPF programs, __always_inline had to be used in
the BPF C programs for all functions, unnecessarily causing LLVM to inflate
code size. Handle this more naturally with support for BPF to BPF calls
such that this __always_inline restriction can be overcome. As a result,
it allows for better optimized code and finally enables to introduce core
BPF libraries in the future that can be reused out of different projects.
x86 and arm64 JIT support was added as well, from Alexei.
2) Add infrastructure for tagging functions as error injectable and allow for
BPF to return arbitrary error values when BPF is attached via kprobes on
those. This way of injecting errors generically eases testing and debugging
without having to recompile or restart the kernel. Tags for opting-in for
this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.
3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
call for XDP programs. First part of this work adds handling of BPF
capabilities included in the firmware, and the later patches add support
to the nfp verifier part and JIT as well as some small optimizations,
from Jakub.
4) The bpftool now also gets support for basic cgroup BPF operations such
as attaching, detaching and listing current BPF programs. As a requirement
for the attach part, bpftool can now also load object files through
'bpftool prog load'. This reuses libbpf which we have in the kernel tree
as well. bpftool-cgroup man page is added along with it, from Roman.
5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
a single perf event") added support for attaching multiple BPF programs
to a single perf event. Given they are configured through perf's ioctl()
interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
command in this work in order to return an array of one or multiple BPF
prog ids that are currently attached, from Yonghong.
6) Various minor fixes and cleanups to the bpftool's Makefile as well
as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
itself or prior installed documentation related to it, from Quentin.
7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
required for the test_dev_cgroup test case to run, from Naresh.
8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.
9) Fix libbpf's exit code from the Makefile when libelf was not found in
the system, also from Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to do error injection with BPF for open_ctree.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.
Fixes: 387125fc72 ("Btrfs: fix barrier flushes")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:
------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
setup_items_for_insert+0x385/0x650 [btrfs]
__btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----
[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.
This makes tree-checker catch uninitialized data as error, causing
such panic.
*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()
[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.
So we can still get early warning if we screw up item pointers, and
avoid false panic.
Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get a significant amount of delayed refs for a single block (think
modifying multiple snapshots) we can end up spending an ungodly amount
of time looping through all of the entries trying to see if they can be
merged. This is because we only add them to a list, so we have O(2n)
for every ref head. This doesn't make any sense as we likely have refs
for different roots, and so they cannot be merged. Tracking in a tree
will allow us to break as soon as we hit an entry that doesn't match,
making our worst case O(n).
With this we can also merge entries more easily. Before we had to hope
that matching refs were on the ends of our list, but with the tree we
can search down to exact matches and merge them at insert time.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The way we handle delalloc metadata reservations has gotten
progressively more complicated over the years. There is so much cruft
and weirdness around keeping the reserved count and outstanding counters
consistent and handling the error cases that it's impossible to
understand.
Fix this by making the delalloc block rsv per-inode. This way we can
calculate the actual size of the outstanding metadata reservations every
time we make a change, and then reserve the delta based on that amount.
This greatly simplifies the code everywhere, and makes the error
handling in btrfs_delalloc_reserve_metadata far less terrifying.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
Let's unifiy the code base to always use the symbolic constants. No functional
changes
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is just excessive information in the ref_head, and makes the code
complicated. It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case. With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were having corruption issues that were tied back to problems with
the extent tree. In order to track them down I built this tool to try
and find the culprit, which was pretty successful. If you compile with
this tool on it will live verify every ref update that the fs makes and
make sure it is consistent and valid. I've run this through with
xfstests and haven't gotten any false positives. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update error messages, add fixup from Dan Carpenter to handle errors
of read_tree_block ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.
Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.
However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was intended to congest higher layers to not send bios, but as
1) the congested bit has been taken by writeback
Async bios come from buffered writes and DIO writes.
For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.
2) and no one is waiting for %nr_async_bios down to zero,
Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus. And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.
We can safely remove them now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
EXTENT_CSUM checker is a relatively easy one, only needs to check:
1) Objectid
Fixed to BTRFS_EXTENT_CSUM_OBJECTID
2) Key offset alignment
Must be aligned to sectorsize
3) Item size alignedment
Must be aligned to csum size
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add extra checks for item with EXTENT_DATA type. This checks the
following thing:
0) Key offset
All key offsets must be aligned to sectorsize.
Inline extent must have 0 for key offset.
1) Item size
Uncompressed inline file extent size must match item size.
(Compressed inline file extent has no information about its on-disk size.)
Regular/preallocated file extent size must be a fixed value.
2) Every member of regular file extent item
Including alignment for bytenr and offset, possible value for
compression/encryption/type.
3) Type/compression/encode must be one of the valid values.
This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.
Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current check_leaf() function does a good job checking key order and
item offset/size.
However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.
So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.
And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.
This makes later item/key offset checks and item size checks easier to
be implemented.
Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.
Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have started plug in btrfs_write_and_wait_marked_extents() but the
generated IOs actually go to device's schedule IO list where the work
is doing in another task, thus the started plug doesn't make any
sense.
And since we wait for IOs immediately after writing meta blocks, it's
the same case as writing log tree, doing sync submit can merge more
IOs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.
kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.
Fix this by publishing the fsid through struct super_block.s_uuid.
[ dsterba: I think that setting s_uuid is the last missing bit. Overlay
needs the file handle encoding support from the lower filesystem, which
is supported. Filling the whole filesystem id is correct, the subvolume
id is encoded in the file handle buffer from inside btrfs_encode_fh. ]
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from David Sterba:
"We've collected a bunch of isolated fixes, for crashes, user-visible
behaviour or missing bits from other subsystem cleanups from the past.
The overall number is not small but I was not able to make it
significantly smaller. Most of the patches are supposed to go to
stable"
* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: log csums for all modified extents
Btrfs: fix unexpected result when dio reading corrupted blocks
btrfs: Report error on removing qgroup if del_qgroup_item fails
Btrfs: skip checksum when reading compressed data if some IO have failed
Btrfs: fix kernel oops while reading compressed data
Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
Btrfs: do not backup tree roots when fsync
btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
Btrfs: send: fix error number for unknown inode types
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: finish ordered extent cleaning if no progress is found
btrfs: clear ordered flag on cleaning up ordered extents
Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
Btrfs: do not reset bio->bi_ops while writing bio
Btrfs: use the new helper wbc_to_write_flags
It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull mount flag updates from Al Viro:
"Another chunk of fmount preparations from dhowells; only trivial
conflicts for that part. It separates MS_... bits (very grotty
mount(2) ABI) from the struct super_block ->s_flags (kernel-internal,
only a small subset of MS_... stuff).
This does *not* convert the filesystems to new constants; only the
infrastructure is done here. The next step in that series is where the
conflicts would be; that's the conversion of filesystems. It's purely
mechanical and it's better done after the merge, so if you could run
something like
list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$')
sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \
-e 's/\<MS_NOSUID\>/SB_NOSUID/g' \
-e 's/\<MS_NODEV\>/SB_NODEV/g' \
-e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \
-e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \
-e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \
-e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \
-e 's/\<MS_NOATIME\>/SB_NOATIME/g' \
-e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \
-e 's/\<MS_SILENT\>/SB_SILENT/g' \
-e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \
-e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \
-e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \
-e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \
$list
and commit it with something along the lines of 'convert filesystems
away from use of MS_... constants' as commit message, it would save a
quite a bit of headache next cycle"
* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Differentiate mount flags (MS_*) from internal superblock flags
VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)
vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
Pull zstd support from Chris Mason:
"Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while. After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull
request.
zstd is a big win in speed over zlib and in compression ratio over
lzo, and the compression team here at FB has gotten great results
using it in production. Nick will continue to update the kernel side
with new improvements from the open source zstd userland code.
Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:
I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB
of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel
Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using
`silesia.tar` [3], which is 211,988,480 B large. Run the following
commands for the benchmark:
sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test
The time is reported by the time of the userland `cp`.
The MB/s is computed with
1,536,217,008 B / time(buffer size, hash)
which includes the time to copy from userland.
The Adjusted MB/s is computed with
1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).
The memory reported is the amount of memory the compressor
requests.
| Method | Size (B) | Time (s) | Ratio | MB/s | Adj MB/s | Mem (MB) |
|----------|----------|----------|-------|---------|----------|----------|
| none | 11988480 | 0.100 | 1 | 2119.88 | - | - |
| zstd -1 | 73645762 | 1.044 | 2.878 | 203.05 | 224.56 | 1.23 |
| zstd -3 | 66988878 | 1.761 | 3.165 | 120.38 | 127.63 | 2.47 |
| zstd -5 | 65001259 | 2.563 | 3.261 | 82.71 | 86.07 | 2.86 |
| zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 | 16.13 | 13.22 |
| zstd -15 | 58009756 | 47.601 | 3.654 | 4.45 | 4.46 | 21.61 |
| zstd -19 | 54014593 | 102.835 | 3.925 | 2.06 | 2.06 | 60.15 |
| zlib -1 | 77260026 | 2.895 | 2.744 | 73.23 | 75.85 | 0.27 |
| zlib -3 | 72972206 | 4.116 | 2.905 | 51.50 | 52.79 | 0.27 |
| zlib -6 | 68190360 | 9.633 | 3.109 | 22.01 | 22.24 | 0.27 |
| zlib -9 | 67613382 | 22.554 | 3.135 | 9.40 | 9.44 | 0.27 |
I benchmarked zstd decompression using the same method on the same
machine. The benchmark file is located in the upstream zstd repo
under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The
memory reported is the amount of memory required to decompress
data compressed with the given compression level. If you know the
maximum size of your input, you can reduce the memory usage of
decompression irrespective of the compression level.
| Method | Time (s) | MB/s | Adjusted MB/s | Memory (MB) |
|----------|----------|---------|---------------|-------------|
| none | 0.025 | 8479.54 | - | - |
| zstd -1 | 0.358 | 592.15 | 636.60 | 0.84 |
| zstd -3 | 0.396 | 535.32 | 571.40 | 1.46 |
| zstd -5 | 0.396 | 535.32 | 571.40 | 1.46 |
| zstd -10 | 0.374 | 566.81 | 607.42 | 2.51 |
| zstd -15 | 0.379 | 559.34 | 598.84 | 4.61 |
| zstd -19 | 0.412 | 514.54 | 547.77 | 8.80 |
| zlib -1 | 0.940 | 225.52 | 231.68 | 0.04 |
| zlib -3 | 0.883 | 240.08 | 247.07 | 0.04 |
| zlib -6 | 0.844 | 251.17 | 258.84 | 0.04 |
| zlib -9 | 0.837 | 253.27 | 287.64 | 0.04 |
I ran a long series of tests and benchmarks on the btrfs side and the
gains are very similar to the core benchmarks Nick ran"
* 'zstd-minimal' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
squashfs: Add zstd support
btrfs: Add zstd support
lib: Add zstd modules
lib: Add xxhash module
Pull btrfs updates from David Sterba:
"The changes range through all types: cleanups, core chagnes, sanity
checks, fixes, other user visible changes, detailed list below:
- deprecated: user transaction ioctl
- mount option ssd does not change allocation alignments
- degraded read-write mount is allowed if all the raid profile
constraints are met, now based on more accurate check
- defrag: do not reset compression afterwards; the NOCOMPRESS flag
can be now overriden by defrag
- prep work for better extent reference tracking (related to the
qgroup slowness with balance)
- prep work for compression heuristics
- memory allocation reductions (may help latencies on a loaded
system)
- better accounting for io waiting states
- error handling improvements (removed BUGs)
- added more sanity checks for shared refs
- fix readdir vs pagefault deadlock under some circumstances
- fix for 'no-hole' mode, certain combination of compressed and
inline extents
- send: fix emission of invalid clone operations
- fixup file mode if setting acls fail
- more fixes from fuzzing
- oher cleanups"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
btrfs: submit superblock io with REQ_META and REQ_PRIO
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
btrfs: pass fs_info to btrfs_del_root instead of tree_root
Btrfs: add one more sanity check for shared ref type
Btrfs: remove BUG_ON in __add_tree_block
Btrfs: remove BUG() in add_data_reference
Btrfs: remove BUG() in print_extent_item
Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Btrfs: convert to use btrfs_get_extent_inline_ref_type
Btrfs: add a helper to retrive extent inline ref type
btrfs: scrub: simplify scrub worker initialization
btrfs: scrub: clean up division in scrub_find_csum
btrfs: scrub: clean up division in __scrub_mark_bitmap
btrfs: scrub: use bool for flush_all_writes
btrfs: preserve i_mode if __btrfs_set_acl() fails
btrfs: Remove extraneous chunk_objectid variable
btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs: Remove extra parentheses from condition in copy_items()
...
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.
In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.
The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.
Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The superblock is also metadata of the filesystem so the relevant IO
should be tagged as such. We also tag it as high priority, as it's the
last block committed for metadata from a given transaction. Any delays
would effectively block the whole transaction, also blocking any other
operation holding the device_list_mutex.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch provides a band aid to improve the 'out of the box'
behaviour of btrfs for disks that are detected as being an ssd. In a
general purpose mixed workload scenario, the current ssd mode causes
overallocation of available raw disk space for data, while leaving
behind increasing amounts of unused fragmented free space. This
situation leads to early ENOSPC problems which are harming user
experience and adoption of btrfs as a general purpose filesystem.
This patch modifies the data extent allocation behaviour of the ssd mode
to make it behave identical to nossd mode. The metadata behaviour and
additional ssd_spread option stay untouched so far.
Recommendations for future development are to reconsider the current
oversimplified nossd / ssd distinction and the broken detection
mechanism based on the rotational attribute in sysfs and provide
experienced users with a more flexible way to choose allocator behaviour
for data and metadata, optimized for certain use cases, while keeping
sane 'out of the box' default settings. The internals of the current
btrfs code have more potential than what currently gets exposed to the
user to choose from.
The SSD story...
In the first year of btrfs development, around early 2008, btrfs
gained a mount option which enables specific functionality for
filesystems on solid state devices. The first occurance of this
functionality is in commit e18e4809, labeled "Add mount -o ssd, which
includes optimizations for seek free storage".
The effect on allocating free space for doing (data) writes is to
'cluster' writes together, writing them out in contiguous space, as
opposed to a 'tetris' way of putting all separate writes into any free
space fragment that fits (which is what the -o nossd behaviour does).
A somewhat simplified explanation of what happens is that, when for
example, the 'cluster' size is set to 2MiB, when we do some writes, the
data allocator will search for a free space block that is 2MiB big, and
put the writes in there. The ssd mode itself might allow a 2MiB cluster
to be composed of multiple free space extents with some existing data in
between, while the additional ssd_spread mount option kills off this
option and requires fully free space.
The idea behind this is (commit 536ac8ae): "The [...] clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
block. So, effectively this means applying a "locality based algorithm"
and trying to outsmart the actual ssd.
Since then, various changes have been made to the involved code, but the
basic idea is still present, and gets activated whenever the ssd mount
option is active. This also happens by default, when the rotational flag
as seen at /sys/block/<device>/queue/rotational is set to 0.
However, there's a number of problems with this approach.
First, what the optimization is trying to do is outsmart the ssd by
assuming there is a relation between the physical address space of the
block device as seen by btrfs and the actual physical storage of the
ssd, and then adjusting data placement. However, since the introduction
of the Flash Translation Layer (FTL) which is a part of the internal
controller of an ssd, these attempts are futile. The use of good quality
FTL in consumer ssd products might have been limited in 2008, but this
situation has changed drastically soon after that time. Today, even the
flash memory in your automatic cat feeding machine or your grandma's
wheelchair has a full featured one.
Second, the behaviour as described above results in the filesystem being
filled up with badly fragmented free space extents because of relatively
small pieces of space that are freed up by deletes, but not selected
again as part of a 'cluster'. Since the algorithm prefers allocating a
new chunk over going back to tetris mode, the end result is a filesystem
in which all raw space is allocated, but which is composed of
underutilized chunks with a 'shotgun blast' pattern of fragmented free
space. Usually, the next problematic thing that happens is the
filesystem wanting to allocate new space for metadata, which causes the
filesystem to fail in spectacular ways.
Third, the default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with spreading out writes over
the full address space and ignoring freed up space leads to worst case
behaviour in providing information to the ssd itself, since it will
never learn that all the free space left behind is actually free. There
are two ways to let an ssd know previously written data does not have to
be preserved, which are sending explicit signals using discard or
fstrim, or by simply overwriting the space with new data. The worst
case behaviour is the btrfs ssd_spread mount option in combination with
not having discard enabled. It has a side effect of minimizing the reuse
of free space previously written in.
Fourth, the rotational flag in /sys/ does not reliably indicate if the
device is a locally attached ssd. For example, iSCSI or NBD displays as
non-rotational, while a loop device on an ssd shows up as rotational.
The combination of the second and third problem effectively means that
despite all the good intentions, the btrfs ssd mode reliably causes the
ssd hardware and the filesystem structures and performance to be choked
to death. The clickbait version of the title of this story would have
been "Btrfs ssd optimizations considered harmful for ssds".
The current nossd 'tetris' mode (even still without discard) allows a
pattern of overwriting much more previously used space, causing many
more implicit discards to happen because of the overwrite information
the ssd gets. The actual location in the physical address space, as seen
from the point of view of btrfs is irrelevant, because the actual writes
to the low level flash are reordered anyway thanks to the FTL.
Changes made in the code
1. Make ssd mode data allocation identical to tetris mode, like nossd.
2. Adjust and clean up filesystem mount messages so that we can easily
identify if a kernel has this patch applied or not, when providing
support to end users. Also, make better use of the *_and_info helpers to
only trigger messages on actual state changes.
Backporting notes
Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
* First apply commit 951e7966 "btrfs: drop the nossd flag when
remounting with -o ssd", or fixup the differences manually.
* The rest of the conflicts are because of the fs_info refactoring. So,
for example, instead of using fs_info, it's root->fs_info in
extent-tree.c
Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although this bio has no data attached, it will reach this condition
(bio->bi_opf & REQ_PREFLUSH) and then update the flush_gen of dev_state
in __btrfsic_submit_bio. So we should still submit it through integrity
checker. Otherwise, the integrity checker will throw the following warning
when I mount a newly created btrfs filesystem.
[10264.755497] btrfs: attempt to write superblock which references block M @29523968 (sdb1/1111654400/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
[10264.755498] btrfs: attempt to write superblock which references block M @29523968 (sdb1/37912576/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we
should use the matching constant for the fsid buffer.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The pinned chunks might be left over so we clean them but at this point
of close_ctree, there's noone to race with, the locking can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
Superblock is read and written using buffer heads, we need to set the
bdev blocksize. The magic constant has been hardcoded in several places,
so replace it with a named constant.
Signed-off-by: David Sterba <dsterba@suse.com>
There are two independent parts, one that writes the superblocks and
another that waits for completion. No functional changes, but cleanups,
reformatting and comment updates.
Signed-off-by: David Sterba <dsterba@suse.com>
As we use per-chunk degradable check, the global
num_tolerated_disk_barrier_failures is of no use.
We can now remove it.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices().
But it can be easily changed to the new per-chunk degradable check
framework.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now use the btrfs_check_rw_degradable() to check if we can mount in the
degraded mode.
With this patch, we can mount in the following case:
# mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
# wipefs -a /dev/sdc
# mount /dev/sdb /mnt/btrfs -o degraded
As the single data chunk is only on sdb, so it's OK to mount as
degraded, as missing one device is OK for RAID1.
But still fail in the following case as expected:
# mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
# wipefs -a /dev/sdb
# mount /dev/sdc /mnt/btrfs -o degraded
As the data chunk is only in sdb, so it's not OK to mount it as
degraded.
Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reported-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.
I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].
I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.
The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.
| Method | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None | 0.99 | 504 | 686 |
| lzo | 1.66 | 398 | 442 |
| zlib | 2.58 | 65 | 241 |
| zstd 1 | 2.57 | 260 | 383 |
| zstd 3 | 2.71 | 174 | 408 |
| zstd 6 | 2.87 | 70 | 398 |
| zstd 9 | 2.92 | 43 | 406 |
| zstd 12 | 2.93 | 21 | 408 |
| zstd 15 | 3.01 | 11 | 354 |
The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.
| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None | 0.97 | 0.78 | 0.981 | 5.501 | 8.807 |
| lzo | 2.06 | 1.38 | 1.631 | 8.458 | 8.585 |
| zlib | 3.40 | 1.86 | 7.750 | 21.544 | 11.744 |
| zstd 1 | 3.57 | 1.85 | 2.579 | 11.479 | 9.389 |
[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz
zstd source repository: https://github.com/facebook/zstd
Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Firstly by applying the following with coccinelle's spatch:
@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)
to effect the conversion to sb_rdonly(sb), then by applying:
@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)
@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)
to remove left over excess bracketage and finally by applying:
@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)
to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull btrfs fixes from David Sterba:
"We've identified and fixed a silent corruption (introduced by code in
the first pull), a fixup after the blk_status_t merge and two fixes to
incremental send that Filipe has been hunting for some time"
* 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix unexpected return value of bio_readpage_error
btrfs: btrfs_create_repair_bio never fails, skip error handling
btrfs: cloned bios must not be iterated by bio_for_each_segment_all
Btrfs: fix write corruption due to bio cloning on raid5/6
Btrfs: incremental send, fix invalid memory access
Btrfs: incremental send, fix invalid path for link commands
We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration. Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions. This patch adds assertions to all remaining
bio_for_each_segment_all cases.
[1] https://patchwork.kernel.org/patch/9838535/
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull percpu updates from Tejun Heo:
"These are the percpu changes for the v4.13-rc1 merge window. There are
a couple visibility related changes - tracepoints and allocator stats
through debugfs, along with __ro_after_init markings and a cosmetic
rename in percpu_counter.
Please note that the simple O(#elements_in_the_chunk) area allocator
used by percpu allocator is again showing scalability issues,
primarily with bpf allocating and freeing large number of counters.
Dennis is working on the replacement allocator and the percpu
allocator will be seeing increased churns in the coming cycles"
* 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix static checker warnings in pcpu_destroy_chunk
percpu: fix early calls for spinlock in pcpu_stats
percpu: resolve err may not be initialized in pcpu_alloc
percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
percpu: add tracepoint support for percpu memory
percpu: expose statistics about percpu memory via debugfs
percpu: migrate percpu data structures to internal header
percpu: add missing lockdep_assert_held to func pcpu_free_area
mark most percpu globals as __ro_after_init
Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.
User visible changes:
- statx support
- quota override tunable
- improved compression thresholds
- obsoleted mount option alloc_start
Core updates:
- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios
- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates
- prep work for btree_inode removal
- dir-item validation
- qgoup fixes and updates
- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"
* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.
Instead, we can allocate the bio at the same time we allocate the device
and reuse it each time we need to flush the barriers. The bio is reset
before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).
The bio used to be submitted through the integrity checker which will
find out that bio has no data attached and call submit_bio.
Status of the bio in flight needs to be tracked separately in case the
device caches get switched off between write and wait.
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable. Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add. In terms of context-safety,
they're equivalent. The only difference is that the __ version takes
a batch parameter.
Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.
This patch doesn't cause any functional changes.
tj: Minor updates to patch description for clarity. Cosmetic
indentation updates.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>
We can keep the state among the other fs_info flags, there's no reason
why fs_frozen would need to be separate.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Submit and wait parts of write_dev_flush() can be split into two
separate functions for better readability.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no extra benefit to count null bdev during the submit loop,
as these null devices will be anyway checked during command
completion device loop just after the submit loop. We are holding the
device_list_mutex, the device->bdev status won't change in between.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling"
write_dev_flush will not return ENOMEM in the sending part. We do not
need to check for it in the callers.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Observing the number of slab objects of btrfs_transaction, there's just
one active on an almost quiescent filesystem, and the number of objects
goes to about ten when sync is in progress. Then the nubmer goes down to
1. This matches the expectations of the transaction lifetime.
For such use the separate slab cache is not justified, as we do not
reuse objects frequently. For the shortlived transaction, the generic
slab (size 512) should be ok. We can optimistically expect that the 512
slabs are not all used (fragmentation) and there are free slots to take
when we do the allocation, compared to potentially allocating a whole new
page for the separate slab.
We'll lose the stats about the object use, which could be added later if
we really need them.
Signed-off-by: David Sterba <dsterba@suse.com>
Nothing checks its return value.
Is it safe to skip checking return value of btrfs_wait_tree_block_writeback?
Liu Bo: I think yes, it's used in walk_log_tree which is called in two
places, free_log_tree and log replay. For free_log_tree, it waits for
any running writeback of the extent buffer under freeing to finish in
case we need to access the eb pointer from page->private, and it's OK to
not check the return value, while for log replay, it's doesn't wait
because wc->wait is not set. So neither cares about the writeback error.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ added more explanation to changelog, from Liu Bo ]
Signed-off-by: David Sterba <dsterba@suse.com>
The end io work queue items have been tracked by the work queues since
"Btrfs: Add async worker threads for pre and post IO checksumming"
(8b71284292) (2008).
Signed-off-by: David Sterba <dsterba@suse.com>
The list used to track checksums in the early version (2.6.29), but I
was able not pinpoint the commit that stopped using it. Everything
apparently works without it for a long time.
Signed-off-by: David Sterba <dsterba@suse.com>
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks. This works fine when everything that has
an extent_io_tree has an inode. But we are going to remove the
btree_inode, so we need to change this. Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead. This had a lot of cascading changes but
should be relatively straightforward.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else. Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds comments to the flush error handling part of the code, and
hopes to maintain the same logic with a framework which can be used to
handle the errors at the volume level.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull btrfs fixes from Chris Mason:
"Some fixes that Dave Sterba collected.
We've been hitting an early enospc problem on production machines that
Omar tracked down to an old int->u64 mistake. I waited a bit on this
pull to make sure it was really the problem from production, but it's
on ~2100 hosts now and I think we're good.
Omar also noticed a commit in the queue would make new early ENOSPC
problems. I pulled that out for now, which is why the top three
commits are younger than the rest.
Otherwise these are all fixes, some explaining very old bugs that
we've been poking at for a while"
* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix delalloc accounting leak caused by u32 overflow
Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
btrfs: tree-log.c: Wrong printk information about namelen
btrfs: fix race with relocation recovery and fs_root setup
btrfs: fix memory leak in update_space_info failure path
btrfs: use correct types for page indices in btrfs_page_exists_in_range
btrfs: fix incorrect error return ret being passed to mapping_set_error
btrfs: Make flush bios explicitely sync
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit b685d3d65a "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions. generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions
Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Fixes: b685d3d65a
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"This has fixes and cleanups Dave Sterba collected for the merge
window.
The biggest functional fixes are between btrfs raid5/6 and scrub, and
raid5/6 and device replacement. Some of our pending qgroup fixes are
included as well while I bash on the rest in testing.
We also have the usual set of cleanups, including one that makes
__btrfs_map_block() much more maintainable, and conversions from
atomic_t to refcount_t"
* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits)
btrfs: fix the gfp_mask for the reada_zones radix tree
Btrfs: fix reported number of inode blocks
Btrfs: send, fix file hole not being preserved due to inline extent
Btrfs: fix extent map leak during fallocate error path
Btrfs: fix incorrect space accounting after failure to insert inline extent
Btrfs: fix invalid attempt to free reserved space on failure to cow range
btrfs: Handle delalloc error correctly to avoid ordered extent hang
btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
btrfs: check if the device is flush capable
btrfs: delete unused member nobarriers
btrfs: scrub: Fix RAID56 recovery race condition
btrfs: scrub: Introduce full stripe lock for RAID56
btrfs: Use ktime_get_real_ts for root ctime
Btrfs: handle only applicable errors returned by btrfs_get_extent
btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
btrfs: use q which is already obtained from bdev_get_queue
Btrfs: switch to div64_u64 if with a u64 divisor
Btrfs: update scrub_parity to use u64 stripe_len
Btrfs: enable repair during read for raid56 profile
btrfs: use clear_page where appropriate
...
Commits cc8385b59e and 7ef70b4d99 added preallocation for the
reada radix trees and also switched them over to GFP_KERNEL for the
default gfp mask.
Since we're doing radix tree insertions under spinlocks, we need
to make sure the mask doesn't allow sleeping. This fix keeps
the radix preallocation but switches back to the original gfp_mask.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
Allocate struct backing_dev_info separately instead of embedding it
inside superblock. This unifies handling of bdi among users.
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: David Sterba <dsterba@suse.com>
CC: linux-btrfs@vger.kernel.org
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
The block layer call chain from submit_bio will check if the write cache
is enabled for the given queue before submitting the flush. This will
add a code to fail fast if its not.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated changelog to reflect current code stat, blkdev_issue_flush is
not used yet ]
Signed-off-by: David Sterba <dsterba@suse.com>
The last consumer of nobarriers is removed by the commit [1] and sync
won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write
through it just return success without actually transpiring such a
request to the block device/lun.
[1]
commit b25de9d6da
block: remove BIO_EOPNOTSUPP
And, as the device/lun write cache state may change dynamically saving
such as state won't help either. So deleting the member nobarriers.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can preallocate the node so insertion does not have to do that under
the lock. The GFP flags for the global radix tree are initialized to
GFP_NOFS & ~__GFP_DIRECT_RECLAIM
but we can use GFP_KERNEL, because readahead is optional and not on any
critical writeout path.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from Chris Mason:
"We have three small fixes queued up in my for-linus-4.11 branch"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix an integer overflow check
btrfs: Change qgroup_meta_rsv to 64bit
Btrfs: bring back repair during read
Using an int value is causing qg->reserved to become negative and
exclusive -EDQUOT to be reached prematurely.
This affects exclusive qgroups only.
TEST CASE:
DEVICE=/dev/vdb
MOUNTPOINT=/mnt
SUBVOL=$MOUNTPOINT/tmp
umount $SUBVOL
umount $MOUNTPOINT
mkfs.btrfs -f $DEVICE
mount /dev/vdb $MOUNTPOINT
btrfs quota enable $MOUNTPOINT
btrfs subvol create $SUBVOL
umount $MOUNTPOINT
mount /dev/vdb $MOUNTPOINT
mount -o subvol=tmp $DEVICE $SUBVOL
btrfs qgroup limit -e 3G $SUBVOL
btrfs quota rescan /mnt -w
for i in `seq 1 44000`; do
dd if=/dev/zero of=/mnt/tmp/test_$i bs=10k count=1
if [[ $? > 0 ]]; then
btrfs qgroup show -pcref $SUBVOL
exit 1
fi
done
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
[ add reproducer to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull more btrfs updates from Chris Mason:
"Btrfs round two.
These are mostly a continuation of Dave Sterba's collection of
cleanups, but Filipe also has some bug fixes and performance
improvements"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (69 commits)
btrfs: add dummy callback for readpage_io_failed and drop checks
btrfs: drop checks for mandatory extent_io_ops callbacks
btrfs: document existence of extent_io ops callbacks
btrfs: let writepage_end_io_hook return void
btrfs: do proper error handling in btrfs_insert_xattr_item
btrfs: handle allocation error in update_dev_stat_item
btrfs: remove BUG_ON from __tree_mod_log_insert
btrfs: derive maximum output size in the compression implementation
btrfs: use predefined limits for calculating maximum number of pages for compression
btrfs: export compression buffer limits in a header
btrfs: merge nr_pages input and output parameter in compress_pages
btrfs: merge length input and output parameter in compress_pages
btrfs: constify name of subvolume in creation helpers
btrfs: constify buffers used by compression helpers
btrfs: constify input buffer of btrfs_csum_data
btrfs: constify device path passed to relevant helpers
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
btrfs: Make btrfs_add_nondir take btrfs_inode
btrfs: Make btrfs_add_link take btrfs_inode
...
Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.
Signed-off-by: David Sterba <dsterba@suse.com>
Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.
Signed-off-by: David Sterba <dsterba@suse.com>
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"This has a series of fixes and cleanups that Dave Sterba has been
collecting.
There is a pretty big variety here, cleaning up internal APIs and
fixing corner cases"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (124 commits)
Btrfs: use the correct type when creating cow dio extent
Btrfs: fix deadlock between dedup on same file and starting writeback
btrfs: use btrfs_debug instead of pr_debug in transaction abort
btrfs: btrfs_truncate_free_space_cache always allocates path
btrfs: free-space-cache, clean up unnecessary root arguments
btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs: flush_space always takes fs_info->fs_root
btrfs: pass fs_info to (more) routines that are only called with extent_root
btrfs: qgroup: Move half of the qgroup accounting time out of commit trans
btrfs: remove unused parameter from adjust_slots_upwards
btrfs: remove unused parameters from __btrfs_write_out_cache
btrfs: remove unused parameter from cleanup_write_cache_enospc
btrfs: remove unused parameter from __add_inode_ref
btrfs: remove unused parameter from clone_copy_inline_extent
btrfs: remove unused parameters from btrfs_cmp_data
btrfs: remove unused parameter from __add_inline_refs
btrfs: remove unused parameters from scrub_setup_wr_ctx
btrfs: remove unused parameter from create_snapshot
btrfs: remove unused parameter from init_first_rw_device
btrfs: remove unused parameter from __btrfs_alloc_chunk
...
None of the checks need to know the ro/rw status as they're all not
changing the superblock. Moreover, we can access the sb flags directly
if we'd need to decide by the ro/rw status.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
write_all_supers and write_ctree_super are almost equal, the parameter
'trans' is unused so we can drop it and have just one helper.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
blk_get_backing_dev_info() is now a simple dereference. Remove that
function and simplify some code around that.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull btrfs updates from Chris Mason:
"Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
here, and Christoph pitched in corrections/improvements to make btrfs
use proper helpers for bio walking instead of doing it by hand.
There are some key fixes as well, including some long standing bugs
that took forever to track down in btrfs_drop_extents and during
balance"
* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
btrfs: limit async_work allocation and worker func duration
Revert "Btrfs: adjust len of writes if following a preallocated extent"
Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
btrfs: opencode chunk locking, remove helpers
btrfs: remove root parameter from transaction commit/end routines
btrfs: split btrfs_wait_marked_extents into normal and tree log functions
btrfs: take an fs_info directly when the root is not used otherwise
btrfs: simplify btrfs_wait_cache_io prototype
btrfs: convert extent-tree tracepoints to use fs_info
btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
btrfs: root->fs_info cleanup, add fs_info convenience variables
btrfs: root->fs_info cleanup, update_block_group{,flags}
btrfs: root->fs_info cleanup, lock/unlock_chunks
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
btrfs: pull node/sector/stripe sizes out of root and into fs_info
btrfs: root->fs_info cleanup, io_ctl_init
btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
btrfs: struct reada_control.root -> reada_control.fs_info
btrfs: struct btrfsic_state->root should be an fs_info
btrfs: alloc_reserved_file_extent trace point should use extent_root
...
Patches queued up by Filipe:
The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable. There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc). The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.
Signed-off-by: Chris Mason <clm@fb.com>
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This results in btrfs_assert_delayed_root_empty and
btrfs_destroy_delayed_inode taking an fs_info instead of a root.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are many functions that are always called with the same root
argument. Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are 11 functions that accept a root parameter and immediately
overwrite it. We can pass those an fs_info pointer instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This issue was found when I tried to delete a heavily reflinked file,
when deleting such files, other transaction operation will not have a
chance to make progress, for example, start_transaction() will blocked
in wait_current_trans(root) for long time, sometimes it even triggers
soft lockups, and the time taken to delete such heavily reflinked file
is also very large, often hundreds of seconds. Using perf top, it reports
that:
PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
---------------------------------------------------------------------------------------
84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80
11.02% [kernel] [k] delay_tsc
0.79% [kernel] [k] _raw_spin_unlock_irq
0.78% [kernel] [k] _raw_spin_unlock_irqrestore
0.45% [kernel] [k] do_raw_spin_lock
0.18% [kernel] [k] __slab_alloc
It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
work, I found it's select_delayed_ref() causing this issue, for a delayed
head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
select_delayed_ref() will firstly try to iterate node list to find
BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
waste much time.
To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
then in select_delayed_ref(), if this list is not empty, we can directly use
nodes in this list. With this patch, it just took about 10~15 seconds to
delte the same file. Now using perf top, it reports that:
PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs)
----------------------------------------------------------------------------------------
20.74% [kernel] [k] _raw_spin_unlock_irqrestore
16.33% [kernel] [k] __slab_alloc
5.41% [kernel] [k] lock_acquired
4.42% [kernel] [k] lock_acquire
4.05% [kernel] [k] lock_release
3.37% [kernel] [k] _raw_spin_unlock_irq
For normal files, this patch also gives help, at least we do not need to
iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David Sterba <dsterba@suse.com>
csum member of struct btrfs_super_block has array type of u8. It makes
sense that function btrfs_csum_final should be also declared to accept
u8 *. I changed the declaration of method void btrfs_csum_final(u32 crc,
char *result); to void btrfs_csum_final(u32 crc, u8 *result);
Signed-off-by: Domagoj Tršan <domagoj.trsan@gmail.com>
[ changed cast to u8 at several call sites ]
Signed-off-by: David Sterba <dsterba@suse.com>
The only memset we do is to 0, so sink the parameter to the function and
simplify all calls. Rename the function to reflect the behaviour.
Signed-off-by: David Sterba <dsterba@suse.com>
During the time, the function has been shrunk to the point that it just
calls find_extent_buffer, just passing the parameters.
Signed-off-by: David Sterba <dsterba@suse.com>
Originally, the eb and start were passed separately in case eb is NULL.
Since the readahead has been refactored in 4.6, this is not true anymore
and we can get rid of the parameter.
Signed-off-by: David Sterba <dsterba@suse.com>
This can only happen with CONFIG_BTRFS_FS_CHECK_INTEGRITY=y.
Commit 1ba98d0 ("Btrfs: detect corruption when non-root leaf has zero item")
assumes that a leaf is its root when leaf->bytenr == btrfs_root_bytenr(root),
however, we should not use btrfs_root_bytenr(root) since it's mainly got
updated during committing transaction. So the check can fail when doing
COW on this leaf while it is a root.
This changes to use "if (leaf == btrfs_root_node(root))" instead, just like
how we check whether leaf is a root in __btrfs_cow_block().
Fixes: 1ba98d086f (Btrfs: detect corruption when non-root leaf has zero item)
Cc: stable@vger.kernel.org # 4.8+
Reported-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
There are two separate issues that can lead to corrupted free space
trees.
1. The free space tree bitmaps had an endianness issue on big-endian
systems which is fixed by an earlier patch in this series.
2. btrfs-progs before v4.7.3 modified filesystems without updating the
free space tree.
To catch both of these issues at once, we need to force the free space
tree to be rebuilt. To do so, add a FREE_SPACE_TREE_VALID compat_ro bit.
If the bit isn't set, we know that it was either produced by a broken
big-endian kernel or may have been corrupted by btrfs-progs.
This also provides us with a way to add rudimentary read-write support
for the free space tree to btrfs-progs: it can just clear this bit and
have the kernel rebuild the free space tree.
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We moved the code for creating the free space tree the first time that
it's enabled, but didn't move the clearing code along with it. This
breaks my (undocumented) intention that `mount -o
clear_cache,space_cache=v2` would clear the free space tree and then
recreate it.
Fixes: 511711af91 ("btrfs: don't run delayed references while we are creating the free space tree")
Cc: stable@vger.kernel.org # 4.5+
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For many printks, we want to know which file system issued the message.
This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.
fs/btrfs/check-integrity.c is left alone for another day.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch converts printk(KERN_* style messages to use the pr_* versions.
One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."
This patch unsplits user-visible strings.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to check items in a node to make sure that we're reading
a valid one, otherwise we could get various crashes while processing
delayed_refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nobody uses this, it makes no sense to do partial reads of extent buffers.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While processing delayed refs, we may update block group's statistics
and attach it to cur_trans->dirty_bgs, and later writing dirty block
groups will process the list, which happens during
btrfs_commit_transaction().
For whatever reason, the transaction is aborted and dirty_bgs
is not processed in cleanup_transaction(), we end up with memory leak
of these dirty block group cache.
Since btrfs_start_dirty_block_groups() doesn't make it go to the commit
critical section, this also adds the cleanup work inside it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from Chris Mason:
"We've queued up a few different fixes in here. These range from
enospc corners to fsync and quota fixes, and a few targeted at error
handling for corrupt metadata/fuzzing"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix lockdep warning on deadlock against an inode's log mutex
Btrfs: detect corruption when non-root leaf has zero item
Btrfs: check btree node's nritems
btrfs: don't create or leak aliased root while cleaning up orphans
Btrfs: fix em leak in find_first_block_group
btrfs: do not background blkdev_put()
Btrfs: clarify do_chunk_alloc()'s return value
btrfs: fix fsfreeze hang caused by delayed iputs deal
btrfs: update btrfs_space_info's bytes_may_use timely
btrfs: divide btrfs_update_reserved_bytes() into two functions
btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
btrfs: qgroup: Fix qgroup incorrectness caused by log replay
btrfs: relocation: Fix leaking qgroups numbers on data extents
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
btrfs: waiting on qgroup rescan should not always be interruptible
btrfs: properly track when rescan worker is running
btrfs: flush_space: treat return value of do_chunk_alloc properly
Btrfs: add ASSERT for block group's memory leak
btrfs: backref: Fix soft lockup in __merge_refs function
Btrfs: fix memory leak of reloc_root
Right now we treat leaf which has zero item as a valid one
because we could have an empty tree, that is, a root that is
also a leaf without any item, however, in the same case but
when the leaf is not a root, we can end up with hitting the
BUG_ON(1) in btrfs_extend_item() called by
setup_inline_extent_backref().
This makes us check the situation as a corruption if leaf is
not its own root.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When btree node (level = 1) has nritems which equals to zero,
we can end up with panic due to insert_ptr()'s
BUG_ON(slot > nritems);
where slot is 1 and nritems is 0, as copy_for_split() calls
insert_ptr(.., path->slots[1] + 1, ...);
A invalid value results in the whole mess, this adds the check
for btree's node nritems so that we stop reading block when
when something is wrong.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
commit 909c3a22da (Btrfs: fix loading of orphan roots leading to BUG_ON)
avoids the BUG_ON but can add an aliased root to the dead_roots list or
leak the root.
Since we've already been loading roots into the radix tree, we should
use it before looking the root up on disk.
Cc: <stable@vger.kernel.org> # 4.5
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When running fstests generic/068, sometimes we got below deadlock:
xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080
ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000
ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001
ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8
Call Trace:
[<ffffffff816a9045>] schedule+0x35/0x80
[<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140
[<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100
[<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs]
[<ffffffff810d32b5>] percpu_down_read+0x35/0x50
[<ffffffff81217dfc>] __sb_start_write+0x2c/0x40
[<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs]
[<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs]
[<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs]
[<ffffffff81230a1a>] evict+0xba/0x1a0
[<ffffffff812316b6>] iput+0x196/0x200
[<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs]
[<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs]
[<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs]
[<ffffffff81218040>] freeze_super+0xf0/0x190
[<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0
[<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70
[<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140
[<ffffffff81229409>] SyS_ioctl+0x79/0x90
[<ffffffff81003c12>] do_syscall_64+0x62/0x110
[<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25
>From this warning, freeze_super() already holds SB_FREEZE_FS, but
btrfs_freeze() will call btrfs_commit_transaction() again, if
btrfs_commit_transaction() finds that it has delayed iputs to handle,
it'll start_transaction(), which will try to get SB_FREEZE_FS lock
again, then deadlock occurs.
The root cause is that in btrfs, sync_filesystem(sb) does not make
sure all metadata is updated. There still maybe some codes adding
delayed iputs, see below sample race window:
CPU1 | CPU2
|-> freeze_super() |
|-> sync_filesystem(sb); |
| |-> cleaner_kthread()
| | |-> btrfs_delete_unused_bgs()
| | |-> btrfs_remove_chunk()
| | |-> btrfs_remove_block_group()
| | |-> btrfs_add_delayed_iput()
| |
|-> sb->s_writers.frozen = SB_FREEZE_FS; |
|-> sb_wait_write(sb, SB_FREEZE_FS); |
| acquire SB_FREEZE_FS lock. |
| |
|-> btrfs_freeze() |
|-> btrfs_commit_transaction() |
|-> btrfs_run_delayed_iputs() |
| will handle delayed iputs, |
| that means start_transaction() |
| will be called, which will try |
| to get SB_FREEZE_FS lock. |
To fix this issue, introduce a "int fs_frozen" to record internally whether
fs has been frozen. If fs has been frozen, we can not handle delayed iputs.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to btrfs_freeze ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
We wait on qgroup rescan completion in three places: file system
shutdown, the quota disable ioctl, and the rescan wait ioctl. If the
user sends a signal while we're waiting, we continue happily along. This
is expected behavior for the rescan wait ioctl. It's racy in the shutdown
path but mostly works due to other unrelated synchronization points.
In the quota disable path, it Oopses the kernel pretty much immediately.
Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
The qgroup_flags field is overloaded such that it reflects the on-disk
status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN
flag is used to indicate that a rescan operation is in progress, but if
the file system is unmounted while a rescan is running, the rescan
operation is paused. If the file system is then mounted read-only,
the flag will still be present but the rescan operation will not have
been resumed. When we go to umount, btrfs_qgroup_wait_for_completion
will see the flag and interpret it to mean that the rescan worker is
still running and will wait for a completion that will never come.
This patch uses a separate flag to indicate when the worker is
running. The locking and state surrounding the qgroup rescan worker
needs a lot of attention beyond this patch but this is enough to
avoid a hung umount.
Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by; Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When some critical errors occur and FS would be flipped into RO,
if we have an on-going balance, we can end up with a memory leak
of root->reloc_root since btrfs_drop_snapshots() bails out
without freeing reloc_root at the very early start.
However, we're not able to free reloc_root in btrfs_drop_snapshots()
because its caller, merge_reloc_roots(), still needs to access it to
cleanup reloc_root's rbtree.
This makes us free reloc_root when we're going to free fs/file roots.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull more btrfs updates from Chris Mason:
"This is part two of my btrfs pull, which is some cleanups and a batch
of fixes.
Most of the code here is from Jeff Mahoney, making the pointers we
pass around internally more consistent and less confusing overall. I
noticed a small problem right before I sent this out yesterday, so I
fixed it up and re-tested overnight"
* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
Btrfs: fix __MAX_CSUM_ITEMS
btrfs: btrfs_abort_transaction, drop root parameter
btrfs: add btrfs_trans_handle->fs_info pointer
btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
btrfs: convert nodesize macros to static inlines
btrfs: introduce BTRFS_MAX_ITEM_SIZE
btrfs: cleanup, remove prototype for btrfs_find_root_ref
btrfs: copy_to_sk drop unused root parameter
btrfs: simpilify btrfs_subvol_inherit_props
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
btrfs: tests, require fs_info for root
btrfs: tests, move initialization into tests/
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs: prefix fsid to all trace events
btrfs: plumb fs_info into btrfs_work
btrfs: remove obsolete part of comment in statfs
btrfs: hide test-only member under ifdef
btrfs: Ratelimit "no csum found" info message
btrfs: Add ratelimit to btrfs printing
Btrfs: fix unexpected balance crash due to BUG_ON
...
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts
- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph
- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives
- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff
- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin
- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
Now that we have a dummy fs_info associated with each test that
uses a root, we don't need the DUMMY_ROOT bit anymore. This lets
us make choices without needing an actual root like in e.g.
btrfs_find_create_tree_block.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This allows the upcoming patchset to push nodesize and sectorsize into
fs_info.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_test_opt and friends only use the root pointer to access
the fs_info. Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to provide an fsid for trace events, we'll need a btrfs_fs_info
pointer. The most lightweight way to do that for btrfs_work structures
is to associate it with the __btrfs_workqueue structure. Each queued
btrfs_work structure has a workqueue associated with it, so that's
a natural fit. It's a privately defined structures, so we add accessors
to retrieve the fs_info pointer.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.
To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.
This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
restricting stripesize to be equal to sectorsize would cause super block
validation to return an error on architectures where PAGE_SIZE is not
equal to 4096.
Hence as a workaround, this commit allows stripesize to be set to 4096
bytes.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes a problem introduced in commit 2f3165ecf1
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".
open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.
Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.
When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer(). An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.
This adds a warning to let us quickly know what was happening.
Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds validation checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
self-tests code assumes 4k as the sectorsize and nodesize. This commit
fix hardcoded 4K. Enables the self-tests code to be executed on non-4k
page sized systems (e.g. ppc64).
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we lack the identification of the filesystem in most if not
all mount messages, done via printk/pr_* functions. We can use the
btrfs_* helpers in open_ctree, as the fs_info <-> sb link is established
at the beginning of the function.
The messages have been updated at the same time to be more consistent:
* dropped sb->s_id, as it's not available via btrfs_*
* added %d for return code where appropriate
* wording changed
* %Lx replaced by %llx
Signed-off-by: David Sterba <dsterba@suse.com>
During a mount, we start the cleaner kthread first because the transaction
kthread wants to wake up the cleaner kthread. We start the transaction
kthread next because everything in btrfs wants transactions. We do reloc
recovery in the thread that was doing the original mount call once the
transaction kthread is running. This means that the cleaner kthread
could already be running when reloc recovery happens (e.g. if a snapshot
delete was started before a crash).
Relocation does not play well with the cleaner kthread, so a mutex was
added in commit 5f3164813b "Btrfs: fix
race between balance recovery and root deletion" to prevent both from
being active at the same time.
If the cleaner kthread is already holding the mutex by the time we get
to btrfs_recover_relocation, the mount will be blocked until at least
one deleted subvolume is cleaned (possibly more if the mount process
doesn't get the lock right away). During this time (which could be an
arbitrarily long time on a large/slow filesystem), the mount process is
stuck and the filesystem is unnecessarily inaccessible.
Fix this by locking cleaner_mutex before we start cleaner_kthread, and
unlocking the mutex after mount no longer requires it. This ensures
that the mounting process will not be blocked by the cleaner kthread.
The cleaner kthread is already prepared for mutex contention and will
just go to sleep until the mutex is available.
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
"PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
Let's stop pretending that pages in page cache are special. They are
not.
The first patch with most changes has been done with coccinelle. The
second is manual fixups on top.
The third patch removes macros definition"
[ I was planning to apply this just before rc2, but then I spaced out,
so here it is right _after_ rc2 instead.
As Kirill suggested as a possibility, I could have decided to only
merge the first two patches, and leave the old interfaces for
compatibility, but I'd rather get it all done and any out-of-tree
modules and patches can trivially do the converstion while still also
working with older kernels, so there is little reason to try to
maintain the redundant legacy model. - Linus ]
* PAGE_CACHE_SIZE-removal:
mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull btrfs fixes from Chris Mason:
"This has a few fixes Dave Sterba had queued up. These are all pretty
small, but since they were tested I decided against waiting for more"
* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: transaction_kthread() is not freezable
btrfs: cleaner_kthread() doesn't need explicit freeze
btrfs: do not write corrupted metadata blocks to disk
btrfs: csum_tree_block: return proper errno value
transaction_kthread() is calling try_to_freeze(), but that's just an
expeinsive no-op given the fact that the thread is not marked freezable.
After removing this, disk-io.c is now independent on freezer API.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
cleaner_kthread() is not marked freezable, and therefore calling
try_to_freeze() in its context is a pointless no-op.
In addition to that, as has been clearly demonstrated by 80ad623edd
("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"), it's perfectly
valid / legal for cleaner_kthread() to stay scheduled out in an arbitrary
place during suspend (in that particular example that was waiting for
reading of extent pages), so there is no need to leave any traces of
freezer in this kthread.
Fixes: 80ad623edd ("Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Fixes: 6962491321 ("btrfs: clear PF_NOFREEZE in cleaner_kthread()")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
but it is better than to have a silent metadata corruption on disk.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"We have a good sized cleanup of our internal read ahead code, and the
first series of commits from Chandan to enable PAGE_SIZE > sectorsize
Otherwise, it's a normal series of cleanups and fixes, with many
thanks to Dave Sterba for doing most of the patch wrangling this time"
* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (82 commits)
btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums
btrfs: Fix misspellings in comments.
btrfs: Print Warning only if ENOSPC_DEBUG is enabled
btrfs: scrub: silence an uninitialized variable warning
btrfs: move btrfs_compression_type to compression.h
btrfs: rename btrfs_print_info to btrfs_print_mod_info
Btrfs: Show a warning message if one of objectid reaches its highest value
Documentation: btrfs: remove usage specific information
btrfs: use kbasename in btrfsic_mount
Btrfs: do not collect ordered extents when logging that inode exists
Btrfs: fix race when checking if we can skip fsync'ing an inode
Btrfs: fix listxattrs not listing all xattrs packed in the same item
Btrfs: fix deadlock between direct IO reads and buffered writes
Btrfs: fix extent_same allowing destination offset beyond i_size
Btrfs: fix file loss on log replay after renaming a file and fsync
Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
Btrfs: fix lockdep deadlock warning due to dev_replace
btrfs: drop unused argument in btrfs_ioctl_get_supported_features
btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
btrfs: change max_inline default to 2048
...
So that its better organized.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Xfstests btrfs/011 complains about a deadlock warning,
[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1226.652955] Possible interrupt unsafe locking scenario:
[ 1226.652955] CPU0 CPU1
[ 1226.652955] ---- ----
[ 1226.652955] lock(&fs_info->dev_replace.lock);
[ 1226.652955] local_irq_disable();
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955] lock(&found->groups_sem);
[ 1226.652955] <Interrupt>
[ 1226.652955] lock(&delayed_node->mutex);
[ 1226.652955]
*** DEADLOCK ***
Commit 084b6e7c76 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.
The above lock chain comes from
btrfs_commit_transaction
->btrfs_run_delayed_items
...
->__btrfs_update_delayed_inode
...
->__btrfs_cow_block
...
->find_free_extent
->cache_block_group
->load_free_space_cache
->btrfs_readpages
->submit_one_bio
...
->__btrfs_map_block
->btrfs_dev_replace_lock
However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to
[ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.
To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.
With this, btrfs/011 no more produces warnings in dmesg.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cleanup.
kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reada creates 2 works for each level of tree recursively.
In case of a tree having many levels, the number of created works
is 2^level_of_tree.
Actually we don't need so many works in parallel, this patch limits
max works to BTRFS_MAX_MIRRORS * 2.
The per-fs works_counter will be also used for btrfs_reada_wait() to
check is there are background workers.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
What __readahead_hook() need exactly is fs_info, no need to convert
fs_info to root in caller and convert back in __readahead_hook()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.
Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current "recovery" mount option will only try to use backup root.
However the word "recovery" is too generic and may be confusing for some
users.
Here introduce a new and more specific mount option, "usebackuproot" to
replace "recovery" mount option.
"Recovery" will be kept for compatibility reason, but will be
deprecated.
Also, since "usebackuproot" will only affect mount behavior and after
open_ctree() it has nothing to do with the filesystem, so clear the flag
after mount succeeded.
This provides the basis for later unified "norecovery" mount option.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[ dropped usebackuproot from show_mount, added note about 'recovery' to
docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to use GFP_NOFS in all contexts, eg. during mount or for
dummy root tree, but we might for the the log tree creation.
Signed-off-by: David Sterba <dsterba@suse.com>
So the old one didn't work properly before alternatives had run.
And it was supposed to provide an optimized JMP because the
assumption was that the offset it is jumping to is within a
signed byte and thus a two-byte JMP.
So I did an x86_64 allyesconfig build and dumped all possible
sites where static_cpu_has() was used. The optimization amounted
to all in all 12(!) places where static_cpu_has() had generated
a 2-byte JMP. Which has saved us a whopping 36 bytes!
This clearly is not worth the trouble so we can remove it. The
only place where the optimization might count - in __switch_to()
- we will handle differently. But that's not subject of this
patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1453842730-28463-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull btrfs fixes from Chris Mason:
"Dave had a small collection of fixes to the new free space tree code,
one of which was keeping our sysfs files more up to date with feature
bits as different things get enabled (lzo, raid5/6, etc).
I should have kept the sysfs stuff for rc3, since we always manage to
trip over something. This time it was GFP_KERNEL from somewhere that
is NOFS only. Instead of rebasing it out I've put a revert in, and
we'll fix it properly for rc3.
Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a
use-after-free in our tracepoints that Dave Jones reported"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "btrfs: synchronize incompat feature bits with sysfs files"
btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
btrfs: sysfs: check initialization state before updating features
Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
btrfs: async-thread: Fix a use-after-free error for trace
Btrfs: fix race between fsync and lockless direct IO writes
btrfs: add free space tree to the cow-only list
btrfs: add free space tree to lockdep classes
btrfs: tweak free space tree bitmap allocation
btrfs: tests: switch to GFP_KERNEL
btrfs: synchronize incompat feature bits with sysfs files
btrfs: sysfs: introduce helper for syncing bits with sysfs files
btrfs: sysfs: add free-space-tree bit attribute
btrfs: sysfs: fix typo in compat_ro attribute definition
Pull more btrfs updates from Chris Mason:
"These are mostly fixes that we've been testing, but also we grabbed
and tested a few small cleanups that had been on the list for a while.
Zhao Lei's patchset also fixes some early ENOSPC buglets"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (21 commits)
btrfs: raid56: Use raid_write_end_io for scrub
btrfs: Remove unnecessary ClearPageUptodate for raid56
btrfs: use rbio->nr_pages to reduce calculation
btrfs: Use unified stripe_page's index calculation
btrfs: Fix calculation of rbio->dbitmap's size calculation
btrfs: Fix no_space in write and rm loop
btrfs: merge functions for wait snapshot creation
btrfs: delete unused argument in btrfs_copy_from_user
btrfs: Use direct way to determine raid56 write/recover mode
btrfs: Small cleanup for get index_srcdev loop
btrfs: Enhance chunk validation check
btrfs: Enhance super validation check
Btrfs: fix deadlock running delayed iputs at transaction commit time
Btrfs: fix typo in log message when starting a balance
btrfs: remove duplicate const specifier
btrfs: initialize the seq counter in struct btrfs_device
Btrfs: clean up an error code in btrfs_init_space_info()
btrfs: fix iterator with update error in backref.c
Btrfs: fix output of compression message in btrfs_parse_options()
Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots
...
Enhance btrfs_check_super_valid() function by the following points:
1) Restrict sector/node size check
Not the old max/min valid check, but also check if it's a power of 2.
So some bogus number like 12K node size won't pass now.
2) Super flag check
For now, there is still some inconsistency between kernel and
btrfs-progs super flags.
And considering btrfs-progs may add new flags for super block, this
check will only output warning.
3) Better root alignment check
Now root bytenr is checked against sector size.
4) Move some check into btrfs_check_super_valid().
Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
And magic number check.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs updates from Chris Mason:
"This has our usual assortment of fixes and cleanups, but the biggest
change included is Omar Sandoval's free space tree. It's not the
default yet, mounting -o space_cache=v2 enables it and sets a readonly
compat bit. The tree can actually be deleted and regenerated if there
are any problems, but it has held up really well in testing so far.
For very large filesystems (30T+) our existing free space caching code
can end up taking a huge amount of time during commits. The new tree
based code is faster and less work overall to update as the commit
progresses.
Omar worked on this during the summer and we'll hammer on it in
production here at FB over the next few months"
* 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits)
Btrfs: fix fitrim discarding device area reserved for boot loader's use
Btrfs: Check metadata redundancy on balance
btrfs: statfs: report zero available if metadata are exhausted
btrfs: preallocate path for snapshot creation at ioctl time
btrfs: allocate root item at snapshot ioctl time
btrfs: do an allocation earlier during snapshot creation
btrfs: use smaller type for btrfs_path locks
btrfs: use smaller type for btrfs_path lowest_level
btrfs: use smaller type for btrfs_path reada
btrfs: cleanup, use enum values for btrfs_path reada
btrfs: constify static arrays
btrfs: constify remaining structs with function pointers
btrfs tests: replace whole ops structure for free space tests
btrfs: use list_for_each_entry* in backref.c
btrfs: use list_for_each_entry_safe in free-space-cache.c
btrfs: use list_for_each_entry* in check-integrity.c
Btrfs: use linux/sizes.h to represent constants
btrfs: cleanup, remove stray return statements
btrfs: zero out delayed node upon allocation
btrfs: pass proper enum type to start_transaction()
...
The following call trace is seen when btrfs/031 test is executed in a loop,
[ 158.661848] ------------[ cut here ]------------
[ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
[ 158.664102] BTRFS: Transaction aborted (error -2)
[ 158.664774] Modules linked in:
[ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2
[ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
[ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
[ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
[ 158.670769] Call Trace:
[ 158.671153] [<ffffffff81431fc8>] dump_stack+0x44/0x5c
[ 158.671884] [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
[ 158.672769] [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
[ 158.673620] [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
[ 158.674440] [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
[ 158.675376] [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
[ 158.676235] [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
[ 158.677268] [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
[ 158.678183] [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
[ 158.678975] [<ffffffff81144b8e>] ? vma_merge+0xee/0x300
[ 158.679751] [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
[ 158.680599] [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
[ 158.681686] [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
[ 158.682581] [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
[ 158.683399] [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
[ 158.684297] [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
[ 158.685051] [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 158.685958] ---[ end trace 4b63312de5a2cb76 ]---
[ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
[ 158.709508] BTRFS info (device loop0): forced readonly
[ 158.737113] BTRFS info (device loop0): disk space caching is enabled
[ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
[ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
This occurs because,
Mount filesystem
Create subvol with ID 257
Unmount filesystem
Mount filesystem
Delete subvol with ID 257
btrfs_drop_snapshot()
Add root corresponding to subvol 257 into
btrfs_transaction->dropped_roots list
Create new subvol (i.e. create_subvol())
257 is returned as the next free objectid
btrfs_read_fs_root_no_name()
Finds the btrfs_root instance corresponding to the old subvol with ID 257
in btrfs_fs_info->fs_roots_radix.
Returns error since btrfs_root_item->refs has the value of 0.
To fix the issue the commit initializes tree root's and subvolume root's
highest_objectid when loading the roots from disk.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use many constants to represent size and offset value. And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'. However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.
So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Looks like oversight, call brelse() when checksum fails. Further down the
code, in the non error path, we do call brelse() and so we don't see
brelse() in the goto error paths.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a short term solution to make sure btrfs_run_delayed_refs()
doesn't change the extent tree while we are scanning it to create the
free space tree.
Longer term we need to synchronize scanning the block groups one by one,
similar to what happens during a balance.
Signed-off-by: Chris Mason <clm@fb.com>
Those are stupid and code should use static_cpu_has_safe() or
boot_cpu_has() instead. Kill the least used and unused ones.
The remaining ones need more careful inspection before a conversion can
happen. On the TODO.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de
Cc: David Sterba <dsterba@suse.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now we can finally hook up everything so we can actually use free space
tree. The free space tree is enabled by passing the space_cache=v2 mount
option. On the first mount with the this option set, the free space tree
will be created and the FREE_SPACE_TREE read-only compat bit will be
set. Any time the filesystem is mounted from then on, we must use the
free space tree. The clear_cache option will also clear the free space
tree.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes and cleanups from Chris Mason:
"Some of this got cherry-picked from a github repo this week, but I
verified the patches.
We have three small scrub cleanups and a collection of fixes"
* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: Use fs_info directly in btrfs_delete_unused_bgs
btrfs: Fix lost-data-profile caused by balance bg
btrfs: Fix lost-data-profile caused by auto removing bg
btrfs: Remove len argument from scrub_find_csum
btrfs: Reduce unnecessary arguments in scrub_recheck_block
btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
btrfs: scrub: setup all fields for sblock_to_check
btrfs: scrub: set error stats when tree block spanning stripes
Btrfs: fix race when listing an inode's xattrs
Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
Btrfs: fix race leading to incorrect item deletion when dropping extents
Btrfs: fix sleeping inside atomic context in qgroup rescan worker
Btrfs: fix race waiting for qgroup rescan worker
btrfs: qgroup: exit the rescan worker during umount
Btrfs: fix extent accounting for partial direct IO writes
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I was hitting a consistent NULL pointer dereference during shutdown that
showed the trace running through end_workqueue_bio(). I traced it back to
the endio_meta_workers workqueue being poked after it had already been
destroyed.
Eventually I found that the root cause was a qgroup rescan that was still
in progress while we were stopping all the btrfs workers.
Currently we explicitly pause balance and scrub operations in
close_ctree(), but we do nothing to stop the qgroup rescan. We should
probably be doing the same for qgroup rescan, but that's a much larger
change. This small change is good enough to allow me to unmount without
crashing.
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
cleaner_kthread() kthread calls try_to_freeze() at the beginning of every
cleanup attempt. This operation can't ever succeed though, as the kthread
hasn't marked itself as freezable.
Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark cleaner_kthread() freezable (as my
understanding is that it can generate filesystem I/O during suspend).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
We have a mechanism to make sure we don't lose updates for ordered extents that
were logged in the transaction that is currently running. We add the ordered
extent to a transaction list and then the transaction waits on all the ordered
extents in that list. However are substantially large file systems this list
can be extremely large, and can give us soft lockups, since the ordered extents
don't remove themselves from the list when they do complete.
To fix this we simply add a counter to the transaction that is incremented any
time we have a logged extent that needs to be completed in the current
transaction. Then when the ordered extent finally completes it decrements the
per transaction counter and wakes up the transaction if we are the last ones.
This will eliminate the softlockup. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs_raid_array[] is used to define all raid attributes, use it
to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(),
instead of complex condition in function.
It can make code simple and auto-support other possible raid-type in
future.
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from Chris Mason:
"These are small and assorted. Neil's is the oldest, I dropped the
ball thinking he was going to send it in"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: support NFSv2 export
Btrfs: open_ctree: Fix possible memory leak
Btrfs: fix deadlock when finalizing block group creation
Btrfs: update fix for read corruption of compressed and shared extents
Btrfs: send, fix corner case for reference overwrite detection
Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.
Signed-off-by: David Sterba <dsterba@suse.com>
After reading one of chunk or tree root tree's root node from disk, if the
root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release
the memory used by the root node. Fix this.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
This uses a chunk of code from btrfs_read_dev_super() and creates
a function called btrfs_read_dev_one_super() so that next patch
can use it for scratch superblock.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed bufhead to bh]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will return EIO when __bread() fails to read SB,
instead of EINVAL.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fixes from Chris Mason:
"This is an assorted set I've been queuing up:
Jeff Mahoney tracked down a tricky one where we ended up starting IO
on the wrong mapping for special files in btrfs_evict_inode. A few
people reported this one on the list.
Filipe found (and provided a test for) a difficult bug in reading
compressed extents, and Josef fixed up some quota record keeping with
snapshot deletion. Chandan killed off an accounting bug during DIO
that lead to WARN_ONs as we freed inodes"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: keep dropped roots in cache until transaction commit
Btrfs: Direct I/O: Fix space accounting
btrfs: skip waiting on ordered range for special files
Btrfs: fix read corruption of compressed and shared extents
Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
Btrfs: don't initialize a space info as full to prevent ENOSPC
Pull btrfs cleanups and fixes from Chris Mason:
"These are small cleanups, and also some fixes for our async worker
thread initialization.
I was having some trouble testing these, but it ended up being a
combination of changing around my test servers and a shiny new
schedule while atomic from the new start/finish_plug in
writeback_sb_inodes().
That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
changing a bunch of things in my test setup at once it would have been
really clear. Fix for writeback_sb_inodes() on the way as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
btrfs: async_thread: Fix workqueue 'max_active' value when initializing
btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance
btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
btrfs: Update out-of-date "skip parity stripe" comment
After commmit e44163e177 ("btrfs: explictly delete unused block groups
in close_ctree and ro-remount"), added in the 4.3 merge window, we have
calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex.
This can cause a deadlock with a concurrent block group relocation (when
a filesystem balance or shrink operation is in progress for example)
because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the
relocation path locks first delete_unused_bgs_mutex and then it locks
cleaner_mutex, resulting in a classic ABBA deadlock:
CPU 0 CPU 1
lock fs_info->cleaner_mutex
__btrfs_balance() || btrfs_shrink_device()
lock fs_info->delete_unused_bgs_mutex
btrfs_relocate_chunk()
btrfs_relocate_block_group()
lock fs_info->cleaner_mutex
btrfs_delete_unused_bgs()
lock fs_info->delete_unused_bgs_mutex
Fix this by not taking the cleaner_mutex before calling
btrfs_delete_unused_bgs() because it's no longer needed after
commit 67c5e7d464 ("Btrfs: fix race between balance and unused block
group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the
spinlock fs_info->unused_bgs_lock and a block group's spinlock are
enough to get correct serialization between tasks running relocation
and unused block group deletion (as well as between multiple tasks
concurrently calling btrfs_delete_unused_bgs()).
This issue was discussed (in the mailing list) during the review of
the patch titled "btrfs: explictly delete unused block groups in
close_ctree and ro-remount" and it was agreed that acquiring the
cleaner mutex had to be dropped after the patch titled
"Btrfs: fix race between balance and unused block group deletion"
got merged (both patches were submitted at about the same time, but
one landed in kernel 4.2 and the other in the 4.3 merge window).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Pull btrfs updates from Chris Mason:
"This has Jeff Mahoney's long standing trim patch that fixes corners
where trims were missing. Omar has some raid5/6 fixes, especially for
using scrub and device replace when devices are missing.
Zhao Lie continues cleaning and fixing things, this series fixes some
really hard to hit corners in xfstests. I had to pull it last merge
window due to some deadlocks, but those are now resolved.
I added support for Tejun's new blkio controllers. It seems to work
well for single devices, we'll expand to multi-device as well"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
btrfs: fix compile when block cgroups are not enabled
Btrfs: fix file read corruption after extent cloning and fsync
Btrfs: check if previous transaction aborted to avoid fs corruption
btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
btrfs: Prevent from early transaction abort
btrfs: Remove unused arguments in tree-log.c
btrfs: Remove useless condition in start_log_trans()
Btrfs: add support for blkio controllers
Btrfs: remove unused mutex from struct 'btrfs_fs_info'
Btrfs: fix parity scrub of RAID 5/6 with missing device
Btrfs: fix device replace of a missing RAID 5/6 device
Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
Btrfs: count devices correctly in readahead during RAID 5/6 replace
Btrfs: remove misleading handling of missing device scrub
btrfs: fix clone / extent-same deadlocks
Btrfs: fix defrag to merge tail file extent
Btrfs: fix warning in backref walking
btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
btrfs: Remove root argument in extent_data_ref_count()
btrfs: Fix wrong comment of btrfs_alloc_tree_block()
...
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
num_tolerated_disk_barrier_failures in btrfs_balance
Code for updating fs_info->num_tolerated_disk_barrier_failures in
btrfs_balance() lacks raid56 support.
Reason:
Above code was wroten in 2012-08-01, together with
btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.
Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
later to support raid56, but code in btrfs_balance() was not
updated together.
Fix:
Merge above similar code to a common function:
btrfs_get_num_tolerated_disk_barrier_failures()
and make it support both case.
It can fix this bug with a bonus of cleanup, and make these code
never in above no-sync state from now on.
Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
1: Use ARRAY_SIZE(types) to replace a static-value variant:
int num_types = 4;
2: Use 'continue' on condition to reduce one level tab
if (!XXX) {
code;
...
}
->
if (XXX)
continue;
code;
...
3: Put setting 'num_tolerated_disk_barrier_failures = 2' to
(num_tolerated_disk_barrier_failures > 2) condition to make
make logic neat.
if (num_tolerated_disk_barrier_failures > 0 && XXX)
num_tolerated_disk_barrier_failures = 0;
else if (num_tolerated_disk_barrier_failures > 1) {
if (XXX)
num_tolerated_disk_barrier_failures = 1;
else if (XXX)
num_tolerated_disk_barrier_failures = 2;
->
if (num_tolerated_disk_barrier_failures > 0 && XXX)
num_tolerated_disk_barrier_failures = 0;
if (num_tolerated_disk_barrier_failures > 1 && XXX)
num_tolerated_disk_barrier_failures = ;
if (num_tolerated_disk_barrier_failures > 2 && XXX)
num_tolerated_disk_barrier_failures = 2;
4: Remove comment of:
num_mirrors - 1: if RAID1 or RAID10 is configured and more
than 2 mirrors are used.
which is not fit with code.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.
Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.
Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.
The end result is able to throttle nicely on single disk filesystems. A
little more work is required for multi-device filesystems.
Signed-off-by: Chris Mason <clm@fb.com>
The code using 'ordered_extent_flush_mutex' mutex has removed by below
commit.
- 8d875f95da
btrfs: disable strict file flushes for renames and truncates
But the mutex still lives in struct 'btrfs_fs_info'.
So, this patch removes the mutex from struct 'btrfs_fs_info' and its
initialization code.
Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
When mount failed because missing device, we can see following
dmesg:
[ 1060.267743] BTRFS: too many missing devices, writeable mount is not allowed
[ 1060.273158] BTRFS: open_ctree failed
This patch add missing_device_number and tolerated_missing_device_number
to above output, to let user know what really happened, and helps
bug-report and debug.
dmesg after patch:
[ 127.050367] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed
[ 127.056099] BTRFS: open_ctree failed
Changelog v1->v2:
1: Changed to more clear description, suggested-by:
Anand Jain <anand.jain@oracle.com>
Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Filipe fixed up a hard to trigger ENOSPC regression from our merge
window pull, and we have a few other smaller fixes"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix quick exhaustion of the system array in the superblock
btrfs: its btrfs_err() instead of btrfs_error()
btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
The cleaner thread may already be sleeping by the time we enter
close_ctree. If that's the case, we'll skip removing any unused
block groups queued for removal, even during a normal umount.
They'll be cleaned up automatically at next mount, but users
expect a umount to be a clean synchronization point, especially
when used on thin-provisioned storage with -odiscard. We also
explicitly remove unused block groups in the ro-remount path
for the same reason.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull btrfs fixes from Chris Mason:
"This is an assortment of fixes. Most of the commits are from Filipe
(fsync, the inode allocation cache and a few others). Mark kicked in
a series fixing corners in the extent sharing ioctls, and everyone
else fixed up on assorted other problems"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix wrong check for btrfs_force_chunk_alloc()
Btrfs: fix warning of bytes_may_use
Btrfs: fix hang when failing to submit bio of directIO
Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
Btrfs: fix memory corruption on failure to submit bio for direct IO
btrfs: don't update mtime/ctime on deduped inodes
btrfs: allow dedupe of same inode
btrfs: fix deadlock with extent-same and readpage
btrfs: pass unaligned length to btrfs_cmp_data()
Btrfs: fix fsync after truncate when no_holes feature is enabled
Btrfs: fix fsync xattr loss in the fast fsync path
Btrfs: fix fsync data loss after append write
Btrfs: fix crash on close_ctree() if cleaner starts new transaction
Btrfs: fix race between caching kthread and returning inode to inode cache
Btrfs: use kmem_cache_free when freeing entry in inode cache
Btrfs: fix race between balance and unused block group deletion
btrfs: add error handling for scrub_workers_get()
btrfs: cleanup noused initialization of dev in btrfs_end_bio()
btrfs: qgroup: allow user to clear the limitation on qgroup
Pull btrfs updates from Chris Mason:
"Outside of our usual batch of fixes, this integrates the subvolume
quota updates that Qu Wenruo from Fujitsu has been working on for a
few releases now. He gets an extra gold star for making btrfs smaller
this time, and fixing a number of quota corners in the process.
Dave Sterba tested and integrated Anand Jain's sysfs improvements.
Outside of exporting a symbol (ack'd by Greg) these are all internal
to btrfs and it's mostly cleanups and fixes. Anand also attached some
of our sysfs objects to our internal device management structs instead
of an object off the super block. It will make device management
easier overall and it's a better fit for how the sysfs files are used.
None of the existing sysfs files are moved around.
Thanks for all the fixes everyone"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
Btrfs: Check if kobject is initialized before put
lib: export symbol kobject_move()
Btrfs: sysfs: add support to show replacing target in the sysfs
Btrfs: free the stale device
Btrfs: use received_uuid of parent during send
Btrfs: fix use-after-free in btrfs_replay_log
btrfs: wait for delayed iputs on no space
btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
btrfs: ulist: Add ulist_del() function.
btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
btrfs: qgroup: Switch rescan to new mechanism.
btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
btrfs: qgroup: Add new function to record old_roots.
btrfs: qgroup: Record possible quota-related extent for qgroup.
btrfs: qgroup: Add function qgroup_update_counters().
...
@log_root_tree should not be referenced after kfree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
This patch replace the rbtree used in ref_head to list.
This has the following advantage:
1) Easier merge logic.
With the new list implement, we only need to care merging the tail
ref_node with the new ref_node.
And this can be done quite easy at insert time, no need to do a
indicated merge at run_delayed_refs().
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The return value of read_tree_block() can confuse callers as it always
returns NULL for either -ENOMEM or -EIO, so it's likely that callers
parse it to a wrong error, for instance, in btrfs_read_tree_root().
This fixes the above issue.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
kobject_unregister is to handle the release of the kobject,
its completion init is being called in btrfs_sysfs_add_one(),
so we don't have to do the same in the open_ctree() again.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io
The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).
Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.
Also, the bio_inc_remaining() interface has been moved local to bio.c.
Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Steps to reproduce:
while true; do
dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
rm /btrfs_dir/file
sync
done
And we'll see dd failed because btrfs return NO_SPACE.
Reason:
Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
in end to free fs space for next write, but sometimes it hadn't
done work on time, because btrfs-cleaner thread get delayed-iputs
from list before, but do iput() after next write.
This is log:
[ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
[ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
[ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
[ 2569.087554] comm=sync func=btrfs_commit_transaction() end
[ 2569.191081] comm=dd begin
[ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
[ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
[ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
...
[ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
[ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
Fix:
Make btrfs_commit_transaction() wait current running btrfs-cleaner's
delayed-iputs() done in end.
Test:
Use script similar to above(more complex),
before patch:
7 failed in 100 * 20 loop.
after patch:
0 failed in 100 * 20 loop.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Near the end of close_ctree, we're calling btrfs_free_block_rsv
to free up the orphan rsv. The problem is this call updates the
space_info, which has already been freed.
This adds a new __ function that directly calls kfree instead of trying
to update the space infos.
Signed-off-by: Chris Mason <clm@fb.com>
We loop through all of the dirty block groups during commit and write
the free space cache. In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.
If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.
This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit. We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.
Signed-off-by: Chris Mason <clm@fb.com>
This was used to make sure that a fresh btrfs from an older mkfs.btrfs,
but it also allows us to mount a buggy btrfs if this btrfs has the right
superblock head part but has something wrong with chunk tree part[1], and
after that we can hit BUG_ON()s set in the code to prevent something
impossible.
Since David has released "Btrfs progs v3.19-rc2", just remove the check,
if anyone who wants to make a fresh btrfs, please use the latest one.
[1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Most of these are fixing extent reservation accounting, or corners
with tree writeback during commit.
Josef's set does add a test, which isn't strictly a fix, but it'll
keep us from making this same mistake again"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix outstanding_extents accounting in DIO
Btrfs: add sanity test for outstanding_extents accounting
Btrfs: just free dummy extent buffers
Btrfs: account merges/splits properly
Btrfs: prepare block group cache before writing
Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
Btrfs: account for the correct number of extents for delalloc reservations
Btrfs: fix merge delalloc logic
Btrfs: fix comp_oper to get right order
Btrfs: catch transaction abortion after waiting for it
btrfs: fix sizeof format specifier in btrfs_check_super_valid()
This patch fixes mips compilation warning:
fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Chris Mason <clm@fb.com>
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.
Signed-off-by: David Sterba <dsterba@suse.cz>
Switch to div_u64 if the divisor is a numeric constant or sum of
sizeof()s. We can remove a few instances of do_div that has the hidden
semtantics of changing the 1st argument.
Small power-of-two divisors are converted to bitshifts, large values are
kept intact for clarity.
Signed-off-by: David Sterba <dsterba@suse.cz>
Pull btrfs updates from Chris Mason:
"This pull is mostly cleanups and fixes:
- The raid5/6 cleanups from Zhao Lei fixup some long standing warts
in the code and add improvements on top of the scrubbing support
from 3.19.
- Josef has round one of our ENOSPC fixes coming from large btrfs
clusters here at FB.
- Dave Sterba continues a long series of cleanups (thanks Dave), and
Filipe continues hammering on corner cases in fsync and others
This all was held up a little trying to track down a use-after-free in
btrfs raid5/6. It's not clear yet if this is just made easier to
trigger with this pull or if its a new bug from the raid5/6 cleanups.
Dave Sterba is the only one to trigger it so far, but he has a
consistent way to reproduce, so we'll get it nailed shortly"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
Btrfs: don't remove extents and xattrs when logging new names
Btrfs: fix fsync data loss after adding hard link to inode
Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Btrfs: account for large extents with enospc
Btrfs: don't set and clear delalloc for O_DIRECT writes
Btrfs: only adjust outstanding_extents when we do a short write
btrfs: Fix out-of-space bug
Btrfs: scrub, fix sleep in atomic context
Btrfs: fix scheduler warning when syncing log
Btrfs: Remove unnecessary placeholder in btrfs_err_code
btrfs: cleanup init for list in free-space-cache
btrfs: delete chunk allocation attemp when setting block group ro
btrfs: clear bio reference after submit_one_bio()
Btrfs: fix scrub race leading to use-after-free
Btrfs: add missing cleanup on sysfs init failure
Btrfs: fix race between transaction commit and empty block group removal
btrfs: add more checks to btrfs_read_sys_array
btrfs: cleanup, rename a few variables in btrfs_read_sys_array
btrfs: add checks for sys_chunk_array sizes
btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
...
Also, remove the two local variables create_uuid_tree
and check_uuid_tree; we can use the existence of
the uuid root and/or the RESCAN_UUID_TREE flag to
determine what action to take.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
close_ctree() has a local fs_info var for convienience;
use it consistently.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
The commit:
8dabb74 Btrfs: change core code of btrfs to support the
device replace operations
added the fs_info argument, but never used it -
just remove it again.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
This patch fixes mips compilation warning:
fs/btrfs/disk-io.c: In function 'btrfs_check_super_valid':
fs/btrfs/disk-io.c:3927:21: warning: format '%lu' expects argument
of type 'long unsigned int', but argument 3 has type 'unsigned int' [-Wformat]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: David Sterba <dsterba@suse.cz>
There are some op tables that can be easily made const, similarly the
sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin.
Signed-off-by: David Sterba <dsterba@suse.cz>
This is the 3rd independent patch of a larger project to cleanup btrfs's
internal usage of btrfs_root. Many functions take btrfs_root only to
grab the fs_info struct.
By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.
This patch reduces the specificity of such functions to accept the
fs_info directly.
These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece. Each patch covers a single file's functions.
This patch affects the following function(s):
1) csum_tree_block
2) csum_dirty_buffer
3) check_tree_block_fsid
4) btrfs_find_tree_block
5) clean_tree_block
Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Verify that possible minimum and maximum size is set, validity of
contents is checked in btrfs_read_sys_array.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
I received a few crafted images from Jiri, all got through the recently
added superblock checks. The lower bounds checks for num_devices and
sector/node -sizes were missing and caused a crash during mount.
Tools for symbolic code execution were used to prepare the images
contents.
Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
There isn't any real use of following members of struct btrfs_root
so delete them.
struct kobject root_kobj;
struct completion kobj_unregister;
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
This has been confusing people for too long, the message is really just
informative.
CC: <stable@vger.kernel.org> # 3.10+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The errors are worth noting and might get missed with INFO level.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
All error conditions from open_ctree shall be ERR. Warning would
suggest that something's wrong and we can continue.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Several messages that point to some internal problem, level INFO is
wrong here.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:
- instead of using it's name for tracing unregistered bdi we just use
"unknown"
- btrfs and ceph can just assign the default read ahead window themselves
like several other filesystems already do.
- we can assign noop_backing_dev_info as the default one in alloc_super.
All filesystems already either assigned their own or
noop_backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.
Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead. Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device. It also removes the need for
the mtd_inodefs filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.
After we've removed all such callers, passing a NULL key is not valid
anymore.
Signed-off-by: David Sterba <dsterba@suse.cz>
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.
Move the path allocation to the callers.
CC: <stable@vger.kernel.org> # 3.14+
Fixes: 3f870c2899 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: David Sterba <dsterba@suse.cz>
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.
Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.
Signed-off-by: David Sterba <dsterba@suse.cz>
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.
This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.
I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:
[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
The corruption always happened when a transaction aborted and then fsck complained
like this:
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system
In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:
1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];
2) node/eb 94945280 is created;
3) eb is persisted to disk;
4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;
5) transaction N + 1 starts;
6) eb is COWed, and btrfs_free_tree_block() called for this eb;
7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);
8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:
[112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
[112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
[112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly
9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();
10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;
11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;
12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.
Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.
This isn't a new problem, as it's been present since 2011 (git commit
acce952b02).
Cc: stable@vger.kernel.org # any kernel released after 2011-01-06
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following
write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes
We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.
Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Copy&paste errors in some messages and add few more missing macro
accessors.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The pending mount option(s) now share namespace and bits with the normal
options, and the existing one for (inode_cache) is unset unconditionally
at each transaction commit.
Introduce a separate namespace for pending changes and enhance the
descriptions of the intended change to use separate bits for each
action.
Signed-off-by: David Sterba <dsterba@suse.cz>
There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.
For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).
Signed-off-by: David Sterba <dsterba@suse.cz>
Pull btrfs fixes from Chris Mason:
"Filipe is nailing down some problems with our skinny extent variation,
and Dave's patch fixes endian problems in the new super block checks"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
Btrfs: properly clean up btrfs_end_io_wq_cache
Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
btrfs: use macro accessors in superblock validation checks
The initial patch c926093ec5 (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
Pull btrfs updates from Chris Mason:
"The largest set of changes here come from Miao Xie. He's cleaning up
and improving read recovery/repair for raid, and has a number of
related fixes.
I've merged another set of fsync fixes from Filipe, and he's also
improved the way we handle metadata write errors to make sure we force
the FS readonly if things go wrong.
Otherwise we have a collection of fixes and cleanups. Dave Sterba
gets a cookie for removing the most lines (thanks Dave)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
btrfs: Fix compile error when CONFIG_SECURITY is not set.
Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
btrfs: Make btrfs handle security mount options internally to avoid losing security label.
Btrfs: send, don't delay dir move if there's a new parent inode
btrfs: add more superblock checks
Btrfs: fix race in WAIT_SYNC ioctl
Btrfs: be aware of btree inode write errors to avoid fs corruption
Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
btrfs: fix shadow warning on cmp
Btrfs: fix compilation errors under DEBUG
Btrfs: fix crash of btrfs_release_extent_buffer_page
Btrfs: add missing end_page_writeback on submit_extent_page failure
btrfs: Fix the wrong condition judgment about subset extent map
Btrfs: fix build_backref_tree issue with multiple shared blocks
Btrfs: cleanup error handling in build_backref_tree
btrfs: move checks for DUMMY_ROOT into a helper
btrfs: new define for the inline extent data start
btrfs: kill extent_buffer_page helper
btrfs: drop constant param from btrfs_release_extent_buffer_page
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
...
Populate btrfs_check_super_valid() with checks that try to verify
consistency of superblock by additional conditions that may arise from
corrupted devices or bitflips. Some of tests are only hints and issue
warnings instead of failing the mount, basically when the checks are
derived from the data found in the superblock.
Tested on a broken image provided by Qu.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.
Signed-off-by: David Sterba <dsterba@suse.cz>
The parent_transid parameter has been unused since its introduction in
ca7a79ad8d ("Pass down the expected generation number when reading
tree blocks"). In reada_tree_block, it was even wrongly set to leafsize.
Transid check is done in the proper read and readahead ignores errors.
Signed-off-by: David Sterba <dsterba@suse.cz>
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The
commit reverted and patches to implement proper fix will be added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space. This happens because all the chunks were allocated for
data since the metadata requirements were so low. But now there's a bunch of
empty data block groups and not enough metadata space to do anything. This
patch solves this problem by automatically deleting empty block groups. If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.
When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it. This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch implement data repair function when direct read fails.
The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
mirror.
- When the io on the mirror ends, we will insert the endio work into the
dedicated btrfs workqueue, not common read endio workqueue, because the
original endio work is still blocked in the btrfs endio workqueue, if we
insert the endio work of the io on the mirror into that workqueue, deadlock
would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.
Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This comments became wrong after c3c532[bdi: add helper function for
doing init and register of a bdi for a file system], so remove them.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The issue was introduced in a79b7d4b3e,
adding allocation of extent_workers, so this stray check is surely not
meant to be a check of something else.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021
Reported-by: Maks Naumov <maksqwe1@ukr.net>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Marc argued that if there are several btrfs filesystems mounted,
while users even don't know which filesystem hit the corrupted
errors something like generation verification failure.
Since @extent_buffer structure has a member @fs_info, let's output
btrfs device info.
Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.
Shaves a few bytes from .text:
text data bss dec hex filename
852418 24560 23112 900090 dbbfa btrfs.ko.before
851074 24584 23112 898770 db6d2 btrfs.ko.after
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
There's no user of the return value and we can get rid of the comment in
put_super.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
The naming is confusing, generic yet used for a specific cache. Add a
prefix 'ino_' or rename appropriately.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
A bit of churn on the for-linus side that would be nice to have
in the core bits for 3.18, so pull it in to catch us up and make
forward progress easier.
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
block/scsi_ioctl.c
bdev_get_queue() returns the request_queue associated with the
specified block_device. blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.
All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().
Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.
We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.
Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --
normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);
work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()
The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will
file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()
So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.
A bit more explanation,
A,B,C -- struct btrfs_work
arg -- struct work_struct
kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);
A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed
B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()
As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.
Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).
When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.
So the situation is that our kthread is waiting forever on work C.
Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.
With this patch, I no long hit the above hang.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Truncates and renames are often used to replace old versions of a file
with new versions. Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.
Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.
This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held. It's rare, but these things can
deadlock.
This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.
Signed-off-by: Chris Mason <clm@fb.com>
Balance recovery is called when RW mounting or remounting from
RO to RW, it is called to finish roots merging.
When doing balance recovery, relocation root's corresponding
fs root(whose root refs is 0) might be destroyed by cleaner
thread, this will make btrfs fail to mount.
Fix this problem by holding @cleaner_mutex when doing balance
recovery.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is a regression from my patch a26e8c9f75, we
need to only unlock the block if we were the one who locked it. Otherwise this
will trip BUG_ON()'s in locking.c Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.
This farms them off to async helper threads. The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.
Signed-off-by: Chris Mason <clm@fb.com>
This exercises the various parts of the new qgroup accounting code. We do some
basic stuff and do some things with the shared refs to make sure all that code
works. I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently qgroups account for space by intercepting delayed ref updates to fs
trees. It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly. The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together. Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc. This patch
accomplishes this by only adding qgroup operations for real ref changes. We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time. This patch encompasses a bunch of
architectural changes
1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.
2) tree mod seq: we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.
3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence. This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point. This allows us to merge delayed refs during runtime.
With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
To ease finding bugs during development related to modifying btree leaves
in such a way that it makes its items not sorted by key anymore. Since this
is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
is set, which isn't meant to be enabled for regular users.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:
kernel BUG at fs/btrfs/async-thread.c:619!
By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.
Thanks to Miao for pointing out this problem.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
In btrfs_create_tree(), if btrfs_insert_root() fails, we should
free root->commit_root.
Reported-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Current btrfs_orphan_cleanup will also cleanup roots which is already in
fs_info->dead_roots without protection.
This will have conditional race with fs_info->cleaner_kthread.
This patch will use refs in root->root_item to detect roots in
dead_roots and avoid conflicts.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.
So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.
Here is my test result(Tested by compilebench):
Memory: 2GB
CPU: 2Cores * 1CPU
Partition: 40GB(SSD)
Test command:
# compilebench -D <mnt> -m
Without this patch:
intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
With this patch:
intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: limit the path size in send to PATH_MAX
Btrfs: correctly set profile flags on seqlock retry
Btrfs: use correct key when repeating search for extent item
Btrfs: fix inode caching vs tree log
Btrfs: fix possible memory leaks in open_ctree()
Btrfs: avoid triggering bug_on() when we fail to start inode caching task
Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
btrfs: replace error code from btrfs_drop_extents
btrfs: Change the hole range to a more accurate value.
btrfs: fix use-after-free in mount_subvol()
Fix possible memory leaks in the following error handling paths:
read_tree_block()
btrfs_recover_log_trees
btrfs_commit_super()
btrfs_find_orphan_roots()
btrfs_cleanup_fs_roots()
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull second set of btrfs updates from Chris Mason:
"The most important changes here are from Josef, fixing a btrfs
regression in 3.14 that can cause corruptions in the extent allocation
tree when snapshots are in use.
Josef also fixed some deadlocks in send/recv and other assorted races
when balance is running"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
Btrfs: fix compile warnings on on avr32 platform
btrfs: allow mounting btrfs subvolumes with different ro/rw options
btrfs: export global block reserve size as space_info
btrfs: fix crash in remount(thread_pool=) case
Btrfs: abort the transaction when we don't find our extent ref
Btrfs: fix EINVAL checks in btrfs_clone
Btrfs: fix unlock in __start_delalloc_inodes()
Btrfs: scrub raid56 stripes in the right way
Btrfs: don't compress for a small write
Btrfs: more efficient io tree navigation on wait_extent_bit
Btrfs: send, build path string only once in send_hole
btrfs: filter invalid arg for btrfs resize
Btrfs: send, fix data corruption due to incorrect hole detection
Btrfs: kmalloc() doesn't return an ERR_PTR
Btrfs: fix snapshot vs nocow writting
btrfs: Change the expanding write sequence to fix snapshot related bug.
btrfs: make device scan less noisy
btrfs: fix lockdep warning with reclaim lock inversion
Btrfs: hold the commit_root_sem when getting the commit root during send
Btrfs: remove transaction from send
...
Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs changes from Chris Mason:
"This is a pretty long stream of bug fixes and performance fixes.
Qu Wenruo has replaced the btrfs async threads with regular kernel
workqueues. We'll keep an eye out for performance differences, but
it's nice to be using more generic code for this.
We still have some corruption fixes and other patches coming in for
the merge window, but this batch is tested and ready to go"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
Btrfs: fix a crash of clone with inline extents's split
btrfs: fix uninit variable warning
Btrfs: take into account total references when doing backref lookup
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
Btrfs: fix incremental send's decision to delay a dir move/rename
Btrfs: remove unnecessary inode generation lookup in send
Btrfs: fix race when updating existing ref head
btrfs: Add trace for btrfs_workqueue alloc/destroy
Btrfs: less fs tree lock contention when using autodefrag
Btrfs: return EPERM when deleting a default subvolume
Btrfs: add missing kfree in btrfs_destroy_workqueue
Btrfs: cache extent states in defrag code path
Btrfs: fix deadlock with nested trans handles
Btrfs: fix possible empty list access when flushing the delalloc inodes
Btrfs: split the global ordered extents mutex
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
Btrfs: reclaim delalloc metadata more aggressively
Btrfs: remove unnecessary lock in may_commit_transaction()
Btrfs: remove the unnecessary flush when preparing the pages
Btrfs: just do dirty page flush for the inode with compression before direct IO
...
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.
This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.
So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().
These two functions are only used for nocow file write operations.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.
Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.
Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.
This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.
This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
I got an error on v3.13:
BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)
how to reproduce:
> mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
> wipefs -a /dev/sdf2
> mount -o degraded /dev/sdf1 /mnt
> btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt
The reason of the error is that barrier_all_devices() failed to submit
barrier to the missing device. However it is clear that we cannot do
anything on missing device, and also it is not necessary to care chunks
on the missing device.
This patch stops sending/waiting barrier if device is missing.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Pull btrfs fixes from Chris Mason:
"We have a small collection of fixes in my for-linus branch.
The big thing that stands out is a revert of a new ioctl. Users
haven't shipped yet in btrfs-progs, and Dave Sterba found a better way
to export the information"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use right clone root offset for compressed extents
btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105
Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol
Btrfs: fix max_inline mount option
Btrfs: fix a lockdep warning when cleaning up aborted transaction
Revert "btrfs: add ioctl to export size of global metadata reservation"
Given now we have 2 spinlock for management of delayed refs,
CONFIG_DEBUG_SPINLOCK=y helped me find this,
[ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
[ 4723.414882] lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
[ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G W O 3.12.0+ #4
[ 4723.421321] Call Trace:
[ 4723.421872] [<ffffffff81680fe7>] dump_stack+0x54/0x74
[ 4723.422753] [<ffffffff81681093>] spin_dump+0x8c/0x91
[ 4723.424979] [<ffffffff816810b9>] spin_bug+0x21/0x26
[ 4723.425846] [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
[ 4723.434424] [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
[ 4723.438747] [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
[ 4723.443321] [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
[ 4723.444692] [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[ 4723.450336] [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
[ 4723.451332] [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
[ 4723.452543] [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
[ 4723.457833] [<ffffffff81079efa>] kthread+0xea/0xf0
[ 4723.458990] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.460133] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[ 4723.460865] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
[ 4723.496521] ------------[ cut here ]------------
----------------------------------------------------------------------
The reason is that we get to call cond_resched_lock(&head_ref->lock) while
still holding @delayed_refs->lock.
So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
dance before and after actually processing delayed refs.
Here we don't drop the lock, others are not able to add new delayed refs to
head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Filipe is fixing compile and boot problems with our crc32c rework, and
Josef has disabled snapshot aware defrag for now.
As the number of snapshots increases, we're hitting OOM. For the
short term we're disabling things until a bigger fix is ready"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use late_initcall instead of module_init
Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
Btrfs: disable snapshot aware defrag for now
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in",
LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not
possible to build a kernel with btrfs enabled (either as module or built-in)
if libcrc32c is not enabled as well. So just replace all uses of libcrc32c
with the equivalent function in btrfs hash.h - btrfs_crc32c.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs updates from Chris Mason:
"This is a pretty big pull, and most of these changes have been
floating in btrfs-next for a long time. Filipe's properties work is a
cool building block for inheriting attributes like compression down on
a per inode basis.
Jeff Mahoney kicked in code to export filesystem info into sysfs.
Otherwise, lots of performance improvements, cleanups and bug fixes.
Looks like there are still a few other small pending incrementals, but
I wanted to get the bulk of this in first"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits)
Btrfs: fix spin_unlock in check_ref_cleanup
Btrfs: setup inode location during btrfs_init_inode_locked
Btrfs: don't use ram_bytes for uncompressed inline items
Btrfs: fix btrfs_search_slot_for_read backwards iteration
Btrfs: do not export ulist functions
Btrfs: rework ulist with list+rb_tree
Btrfs: fix memory leaks on walking backrefs failure
Btrfs: fix send file hole detection leading to data corruption
Btrfs: add a reschedule point in btrfs_find_all_roots()
Btrfs: make send's file extent item search more efficient
Btrfs: fix to catch all errors when resolving indirect ref
Btrfs: fix protection between walking backrefs and root deletion
btrfs: fix warning while merging two adjacent extents
Btrfs: fix infinite path build loops in incremental send
btrfs: undo sysfs when open_ctree() fails
Btrfs: fix snprintf usage by send's gen_unique_name
btrfs: fix defrag 32-bit integer overflow
btrfs: sysfs: list the NO_HOLES feature
btrfs: sysfs: don't show reserved incompat feature
btrfs: call permission checks earlier in ioctls and return EPERM
...
When transaction is aborted, we fail to commit transaction, instead we do
cleanup work. After that when we umount btrfs, we get to free fs roots' log
trees respectively, but that happens after we unpin extents, so those extents
pinned by freeing log trees will remain in memory and lead to the leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Add noinode_cache mount option for btrfs.
Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.
To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.
Also noinode_cache mount option will not stop the caching_kthread.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Currently we have two rb-trees, one for delayed ref heads and one for all of the
delayed refs, including the delayed ref heads. When we process the delayed refs
we have to hold onto the delayed ref lock for all of the selecting and merging
and such, which results in quite a bit of lock contention. This was solved by
having a waitqueue and only one flusher at a time, however this hurts if we get
a lot of delayed refs queued up.
So instead just have an rb tree for the delayed ref heads, and then attach the
delayed ref updates to an rb tree that is per delayed ref head. Then we only
need to take the delayed ref lock when adding new delayed refs and when
selecting a delayed ref head to process, all the rest of the time we deal with a
per delayed ref head lock which will be much less contentious.
The locking rules for this get a little more complicated since we have to lock
up to 3 things to properly process delayed refs, but I will address that problem
later. For now this passes all of xfstests and my overnight stress tests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
We only intent to fua the first superblock in every device from
comments, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
I need to create a fake tree to test qgroups and I don't want to have to setup a
fake btree_inode. The fact is we only use the radix tree for the fs_info, so
everybody else who allocates an extent_io_tree is just wasting the space anyway.
This patch moves the radix tree and its lock into btrfs_fs_info so there is less
stuff I have to fake to do qgroup sanity tests. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is the third step in bootstrapping the btrfs_find_item interface.
The function find_orphan_item(), in orphan.c, is similar to the two
functions already replaced by the new interface. It uses two parameters,
which are already present in the interface, and is nearly identical to
the function brought in in the previous patch.
Replace the two calls to find_orphan_item() with calls to
btrfs_find_item(), with the defined objectid and type that was used
internally by find_orphan_item(), a null path, and a null key. Add a
test for a null path to btrfs_find_item, and if it passes, allocate and
free the path. Finally, remove find_orphan_item().
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
Remove unused variables:
* tree from csum_dirty_buffer,
* tree from btree_readpage_end_io_hook,
* tree from btree_writepages,
* bytenr from btrfs_create_tree,
* fs_info from end_workqueue_fn.
Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch adds per-super attributes to sysfs.
It doesn't publish any attributes yet, but does the proper lifetime
handling as well as the basic infrastructure to add new attributes.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
The way how we process delayed refs is
1) get a bunch of head refs,
2) pick up one head ref,
3) go one node back for any delayed ref updates.
The head ref is also linked in the same rbtree as the delayed ref is,
so in 1) stage, we have to walk one by one including not only head refs, but
delayed refs.
When we have a great number of delayed refs pending to process,
this'll cost time a lot.
Here we introduce a head ref specific rbtree, it only has head refs, so troubles
go away.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
btrfs bits got lost in the rebase
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With immutable biovecs we don't want code accessing bi_io_vec directly -
the uses this patch changes weren't incorrect since they all own the
bio, but it makes the code harder to audit for no good reason - also,
this will help with multipage bvecs later.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
The 'git blame' history shows that, the old transaction commit code has to do
twice to ensure roots are updated and we have to flush metadata and super block
manually, however, right now all of these can be handled well inside
the transaction commit code without extra efforts.
And the error handling part remains same with the current code, -- 'return to
caller once we get error'.
This saves us a transaction commit and a flush of super block, which are both
heavy operations according to ftrace output analysis.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function write_ctree_super() in disk-io.c uses variable ret to return
the result of function write_all_supers(). Since, this variable serves
no purpose, hence the patch removes it and returns the call of the
called function.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.
Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The function free_root_pointers() in disk-io.h contains redundant code.
Therefore, this patch adds a helper function free_root_extent_buffers()
to free_root_pointers() to eliminate redundancy.
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.
However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink(). This doesn't belong in mainline.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert()
fails with EEXIST. That part of the code is very similar to the code in
find_extent_buffer(). This patch replaces radix_tree_lookup() and
surrounding code in alloc_extent_buffer() with find_extent_buffer().
Note that radix_tree_lookup() does not need to be protected by
tree->buffer_lock. It is protected by eb->refs.
While at it, this patch
- changes the other usage of radix_tree_lookup() in alloc_extent_buffer()
with find_extent_buffer() to reduce redundancy.
- removes the unused argument 'len' to find_extent_buffer().
Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Stefan was hitting a panic in the async worker stuff because we had outstanding
read bios while we were stopping the worker threads. You could reproduce this
easily if you mount -o nospace_cache and ran generic/273. This is because the
caching thread stuff is still going and we were stopping all the worker threads.
We need to stop the workers after this work is done, and the free block groups
code will wait for all the caching threads to stop first so we don't run into
this problem. With this patch we no longer panic. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort a transaction we will do the tree log cleanup at unmount, but this
happens after we free up the block groups. This makes all the leak detection
warnings go off because we think we've leaked space but in reality we just
haven't cleaned it up yet. So instead do the block group cleanup stuff after
free'ing the fs roots so we don't get these warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The transactions should be cleaning up their reservations on failure, this just
causes us to have warnings on unmount because we go negative by free'ing
reservations that have already been free'ed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently the hash value used for adding an inode to the VFS's inode
hash table consists of the plain inode number, which is a 64 bits
integer. This results in hash table buckets (hlist_head lists) with
too many elements for at least 2 important scenarios:
1) When we have many subvolumes. Each subvolume has its own btree
where its files and directories are added to, and each has its
own objectid (inode number) namespace. This means that if we have
N subvolumes, and all have inode number X associated to a file or
directory, the corresponding inodes all map to the same hash table
entry, resulting in a bucket (hlist_head list) with N elements;
2) On 32 bits machines. Th VFS hash values are unsigned longs, which
are 32 bits wide on 32 bits machines, and the inode (objectid)
numbers are 64 bits unsigned integers. We simply cast the inode
numbers to hash values, which means that for all inodes with the
same 32 bits lower half, the same hash bucket is used for all of
them. For example, all inodes with a number (objectid) between
0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
the same hash table bucket.
This change ensures the inode's hash value depends both on the
objectid (inode number) and its subvolume's (btree root) objectid.
For 32 bits machines, this change gives better entropy by making
the hash value depend on both the upper and lower 32 bits of the
64 bits hash previously computed.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove unused parameter, 'eb'. Unused since introduction in
5f39d397df
Updated to be rebased against current upstream and correct diff supplied this time!
Signed-off-by: Ross Kirk <ross.kirk@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I was noticing the slab redzone stuff going off every once and a while during
transaction aborts. This was caused by two things
1) We would walk the pending snapshots and set their error to -ECANCELED. We
don't need to do this, the snapshot stuff waits for a transaction commit and if
there is a problem we just free our pending snapshot object and exit. Doing
this was causing us to touch the pending snapshot object after the thing had
already been freed.
2) We were freeing the transaction manually with wanton disregard for it's
use_count reference counter. To fix this I cleaned up the transaction freeing
loop to either wait for the transaction commit to finish if it was in the middle
of that (since it will be cleaned and freed up there) or to do the cleanup
oursevles.
I also moved the global "kill all things dirty everywhere" stuff outside of the
transaction cleanup loop since that only needs to be done once. With this patch
I'm no longer seeing slab corruption because of use after frees. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
During transaction cleanup after an abort we are just removing roots from the
ordered roots list which is incorrect. We have a BUG_ON() to make sure that the
root is still part of the ordered roots list when we put our ordered extent
which we were tripping in this case. So do like we do everywhere else and just
move it to the tail of the ordered roots list and allow the normal cleanup to
take care of stuff. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we abort not during a transaction commit we won't clean up anything until we
unmount. Unfortunately if we abort in the middle of writing out an ordered
extent we won't clean it up and if somebody is waiting on that ordered extent
they will wait forever. To fix this just make the transaction kthread call the
cleanup transaction stuff if it notices theres an error, and make
btrfs_end_transaction wake up the transaction kthread if there is an error.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
While looking at somebodys corruption I became completely convinced that
btrfs_split_item was broken, so I wrote this test to verify that it was working
as it was supposed to. Thankfully it appears to be working as intended, so just
add this test to make sure nobody breaks it in the future. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The fact that btrfs_root_refs() returned 0 for the tree_root caused
bugs in the past, therefore it is set to 1 with this patch and
(hopefully) all affected code is adapted to this change.
I verified this change by temporarily adding WARN_ON() checks
everywhere where btrfs_root_refs() is used, checking whether the
logic of the code is changed by btrfs_root_refs() returning 1
instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
With these added checks, I ran the xfstests './check -g auto'.
The two roots chunk_root and log_root_tree that are only referenced
by the superblock and the log_roots below the log_root_tree still
have btrfs_root_refs() == 0, only the tree_root is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When doing space balance and subvolume destroy at the same time, we met
the following oops:
kernel BUG at fs/btrfs/relocation.c:2247!
RIP: 0010: [<ffffffffa04cec16>] prepare_to_merge+0x154/0x1f0 [btrfs]
Call Trace:
[<ffffffffa04b5ab7>] relocate_block_group+0x466/0x4e6 [btrfs]
[<ffffffffa04b5c7a>] btrfs_relocate_block_group+0x143/0x275 [btrfs]
[<ffffffffa0495c56>] btrfs_relocate_chunk.isra.27+0x5c/0x5a2 [btrfs]
[<ffffffffa0459871>] ? btrfs_item_key_to_cpu+0x15/0x31 [btrfs]
[<ffffffffa048b46a>] ? btrfs_get_token_64+0x7e/0xcd [btrfs]
[<ffffffffa04a3467>] ? btrfs_tree_read_unlock_blocking+0xb2/0xb7 [btrfs]
[<ffffffffa049907d>] btrfs_balance+0x9c7/0xb6f [btrfs]
[<ffffffffa049ef84>] btrfs_ioctl_balance+0x234/0x2ac [btrfs]
[<ffffffffa04a1e8e>] btrfs_ioctl+0xd87/0x1ef9 [btrfs]
[<ffffffff81122f53>] ? path_openat+0x234/0x4db
[<ffffffff813c3b78>] ? __do_page_fault+0x31d/0x391
[<ffffffff810f8ab6>] ? vma_link+0x74/0x94
[<ffffffff811250f5>] vfs_ioctl+0x1d/0x39
[<ffffffff811258c8>] do_vfs_ioctl+0x32d/0x3e2
[<ffffffff811259d4>] SyS_ioctl+0x57/0x83
[<ffffffff813c3bfa>] ? do_page_fault+0xe/0x10
[<ffffffff813c73c2>] system_call_fastpath+0x16/0x1b
It is because we returned the error number if the reference of the root was 0
when doing space relocation. It was not right here, because though the root
was dead(refs == 0), but the space it held still need be relocated, or we
could not remove the block group. So in this case, we should return the root
no matter it is dead or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The BUG() was replaced by btrfs_error() and return -EIO with the
patch "get rid of one BUG() in write_all_supers()", but the missing
mutex_unlock() was overlooked.
The 0-DAY kernel build service from Intel reported the missing
unlock which was found by the coccinelle tool:
fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We only need an async starter if we can't make a GFP_NOFS allocation in our
current path. This is the case for the endio stuff since it happens in IRQ
context, but things like the caching thread workers and the delalloc flushers we
can easily make this allocation and start threads right away. Also change the
worker count for the caching thread pool. Traditionally we limited this to 2
since we took read locks while caching, but nowadays we do this lockless so
there's no reason to limit the number of caching threads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Mitch Harder noticed that the patch 3c64a1a mentioned in the subject
line was causing a kernel BUG() on snapshot deletion.
The patch was wrong. It did not handle cached roots correctly. The
check for root_refs == 0 was removed everywhere where
btrfs_read_fs_root_no_name() had been used to retrieve the root,
because this check was already dealt with in
btrfs_read_fs_root_no_name(). But in the case when the root was
found in the cache, there was no such check.
This patch adds the missing check in the case where the root is
found in the cache.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The second round uses btrfs_error() and return -EIO, the first round
can handle write errors the same way.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This change fixes an issue when removing a device and writing
all super blocks run simultaneously. Here's the steps necessary
for the issue to happen:
1) disk-io.c:write_all_supers() gets a number of N devices from the
super_copy, so it will not panic if it fails to write super blocks
for N - 1 devices;
2) Then it tries to acquire the device_list_mutex, but blocks because
volumes.c:btrfs_rm_device() got it first;
3) btrfs_rm_device() removes the device from the list, then unlocks the
mutex and after the unlock it updates the number of devices in
super_copy to N - 1.
4) write_all_supers() finally acquires the mutex, iterates over all the
devices in the list and gets N - 1 errors, that is, it failed to write
super blocks to all the devices;
5) Because write_all_supers() thinks there are a total of N devices, it
considers N - 1 errors to be ok, and therefore won't panic.
So this change just makes sure that write_all_supers() reads the number
of devices from super_copy after it acquires the device_list_mutex.
Conversely, it changes btrfs_rm_device() to update the number of devices
in super_copy before it releases the device list mutex.
The code path to add a new device (volumes.c:btrfs_init_new_device),
already has the right behaviour: it updates the number of devices in
super_copy while holding the device_list_mutex.
The only code path that doesn't lock the device list mutex
before updating the number of devices in the super copy is
disk-io.c:next_root_backup(), called by open_ctree() during
mount time where concurrency issues can't happen.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but
casts it to a pointer, while all callers cast it to unsigned long again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Internally, btrfs_header_fsid() calculates an unsigned long, but casts
it to a pointer, while all callers cast it to unsigned long again.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This should never be needed, but since all functions are there
to check and rebuild the UUID tree, a mount option is added that
allows to force this check and rebuild procedure.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the filesystem was mounted with an old kernel that was not
aware of the UUID tree, this is detected by looking at the
uuid_tree_generation field of the superblock (similar to how
the free space cache is doing it). If a mismatch is detected
at mount time, a thread is started that does two things:
1. Iterate through the UUID tree, check each entry, delete those
entries that are not valid anymore (i.e., the subvol does not
exist anymore or the value changed).
2. Iterate through the root tree, for each found subvolume, add
the UUID tree entries for the subvolume (if they are not
already there).
This mechanism is also used to handle and repair errors that
happened during the initial creation and filling of the tree.
The update of the uuid_tree_generation field (which indicates
that the state of the UUID tree is up to date) is blocked until
all create and repair operations are successfully completed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When the UUID tree is initially created, a task is spawned that
walks through the root tree. For each found subvolume root_item,
the uuid and received_uuid entries in the UUID tree are added.
This is such a quick operation so that in case somebody wants
to unmount the filesystem while the task is still running, the
unmount is delayed until the UUID tree building task is finished.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This tree is not created by mkfs.btrfs. Therefore when a filesystem
is mounted writable and the UUID tree does not exist, this tree is
created if required. The tree is also added to the fs_info structure
and initialized, but this commit does not yet read or write UUID tree
elements.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__
I tried to filter out the warnings for which patches have already
been sent to the mailing list, pending for inclusion in btrfs-next.
All these changes should be obviously safe.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I added a patch where we started taking the ordered operations mutex when we
waited on ordered extents. We need this because we splice the list and process
it, so if a flusher came in during this scenario it would think the list was
empty and we'd usually get an early ENOSPC. The problem with this is that this
lock is used in transaction committing. So we end up with something like this
Transaction commit
-> wait on writers
Delalloc flusher
-> run_ordered_operations (holds mutex)
->wait for filemap-flush to do its thing
flush task
-> cow_file_range
->wait on btrfs_join_transaction because we're commiting
some other task
-> commit_transaction because we notice trans->transaction->flush is set
-> run_ordered_operations (hang on mutex)
We need to disentangle the ordered operations flushing from the delalloc
flushing, since they are separate things. This solves the deadlock issue I was
seeing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I'ts hardcoded to 30 seconds which is fine for most users. Higher values
defer data being synced to permanent storage with obvious consequences
when the system crashes. The upper bound is not forced, but a warning is
printed if it's more than 300 seconds (5 minutes).
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.
Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some codes still use the cpu_to_lexx instead of the
BTRFS_SETGET_STACK_FUNCS declared in ctree.h.
Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec
and other structures.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"These are the usual mixture of bugs, cleanups and performance fixes.
Miao has some really nice tuning of our crc code as well as our
transaction commits.
Josef is peeling off more and more problems related to early enospc,
and has a number of important bug fixes in here too"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
Btrfs: wait ordered range before doing direct io
Btrfs: only do the tree_mod_log_free_eb if this is our last ref
Btrfs: hold the tree mod lock in __tree_mod_log_rewind
Btrfs: make backref walking code handle skinny metadata
Btrfs: fix crash regarding to ulist_add_merge
Btrfs: fix several potential problems in copy_nocow_pages_for_inode
Btrfs: cleanup the code of copy_nocow_pages_for_inode()
Btrfs: fix oops when recovering the file data by scrub function
Btrfs: make the chunk allocator completely tree lockless
Btrfs: cleanup orphaned root orphan item
Btrfs: fix wrong mirror number tuning
Btrfs: cleanup redundant code in btrfs_submit_direct()
Btrfs: remove btrfs_sector_sum structure
Btrfs: check if we can nocow if we don't have data space
Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
Btrfs: use a percpu to keep track of possibly pinned bytes
Btrfs: check for actual acls rather than just xattrs when caching no acl
Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
Btrfs: optimize reada_for_balance
Btrfs: optimize read_block_for_search
...
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
+TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
XrddRLGlk4Hf+2o7/D7B
=SwaI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
"Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
block size is smaller than the page size (i.e., file systems 1k blocks
on x86, or more interestingly file systems with 4k blocks on Power or
ia64 systems.)
In the cleanup category, the ext4's punch hole implementation was
significantly improved by Lukas Czerner, and now supports bigalloc
file systems. In addition, Jan Kara significantly cleaned up the
write submission code path. We also improved error checking and added
a few sanity checks.
In the optimizations category, two major optimizations deserve
mention. The first is that ext4_writepages() is now used for
nodelalloc and ext3 compatibility mode. This allows writes to be
submitted much more efficiently as a single bio request, instead of
being sent as individual 4k writes into the block layer (which then
relied on the elevator code to coalesce the requests in the block
queue). Secondly, the extent cache shrink mechanism, which was
introduce in 3.9, no longer has a scalability bottleneck caused by the
i_es_lru spinlock. Other optimizations include some changes to reduce
CPU usage and to avoid issuing empty commits unnecessarily."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
ext4: optimize starting extent in ext4_ext_rm_leaf()
jbd2: invalidate handle if jbd2_journal_restart() fails
ext4: translate flag bits to strings in tracepoints
ext4: fix up error handling for mpage_map_and_submit_extent()
jbd2: fix theoretical race in jbd2__journal_restart
ext4: only zero partial blocks in ext4_zero_partial_blocks()
ext4: check error return from ext4_write_inline_data_end()
ext4: delete unnecessary C statements
ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
jbd2: move superblock checksum calculation to jbd2_write_superblock()
ext4: pass inode pointer instead of file pointer to punch hole
ext4: improve free space calculation for inline_data
ext4: reduce object size when !CONFIG_PRINTK
ext4: improve extent cache shrink mechanism to avoid to burn CPU time
ext4: implement error handling of ext4_mb_new_preallocation()
ext4: fix corruption when online resizing a fs with 1K block size
ext4: delete unused variables
ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
jbd2: remove debug dependency on debug_fs and update Kconfig help text
jbd2: use a single printk for jbd_debug()
...
When testing a corrupted fs I noticed I was getting sleep while atomic errors
when the transaction aborted. This is because btrfs_pin_extent may need to
allocate memory and we are calling this under the spin lock. Fix this by moving
it out and doing the pin after dropping the spin lock but before dropping the
mutex, the same way it works when delayed refs run normally. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.
First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.
Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
(A) rescan ioctl
(B) rescan resume by mount
(C) rescan by quota enable
Each case needs its own combination of the four following steps:
(1) set the progress [A, C: zero; B: state of umount]
(2) commit the transaction [A]
(3) set the counters [A, C: zero; B: state of umount]
(4) start worker [A, B, C]
qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.
We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.
As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are several functions whose code is similar, such as
btrfs_find_last_root()
btrfs_read_fs_root_no_radix()
Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.
So cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.
We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to check for NULL in send.c and disk-io.c.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.
By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below
fsstress -p 4 -n 10000 -d $dir
With this patch:
real 0m50.153s
user 0m0.081s
sys 0m6.294s
real 0m51.113s
user 0m0.092s
sys 0m6.220s
real 0m52.610s
user 0m0.096s
sys 0m6.125s avg 6.213
-----------------------------------------------------
Without the patch:
real 0m54.825s
user 0m0.061s
sys 0m10.665s
real 1m6.401s
user 0m0.089s
sys 0m11.218s
real 1m13.768s
user 0m0.087s
sys 0m10.665s avg 10.849
we can see the sys time reduce ~43%.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.
Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread. That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup. This
should fix the race. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root. Thanks
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.
As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.
Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.
This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point. Just add checks to make sure these
roots are set to keep us from panicing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I'm sorry, theres no excuse for this sort of work. We need to use
root->leafsize since eb may be NULL. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.
There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We've added new checks to make sure the super block crc is correct
during mount. A fresh filesystem from an older mkfs won't have the
crc set. This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The superblock checksum is not verified upon mount. <awkward silence>
Add that check and also reorder existing checks to a more logical
order.
Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.
First transaction commit calculates correct superblock checksum and
saves it to disk.
Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.
removed functions:
btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()
btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.
ulist.c functions are left, another patch will take care of those.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If you try to mount -o loop a restored file system it will panic if the file
ends up being smaller than the original disk. This is because we go to try and
get a block for a super that may be past the EOF which makes __getblk return
NULL for a buffer head when we aren't expecting it to. Fix this by dealing with
this case and just jacking up the errors count. With this patch we no longer
panic when mounting a restored file system loopback. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.
A filesystem under rescan can still be umounted. The rescan continues on the
next mount. Status information is provided with a separate ioctl while a
rescan operation is in progress.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.
However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.
To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.
The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.
Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It is a rare exception that a new tree is created, like the qgroups
tree. So far these new trees have an all-zero UUID in their root
items. All trees that mkfs.btrfs has created get an UUID during the
first mount when btrfs_read_root_item() rewrites the root_item to
the v2 structure style. These UUID are never used so far, but
anyway, since it is better to have it uniform for all trees, this
commit adds some lines that generate and write an UUID for newly
created trees.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort. This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.
The second half of this are related to the cleanup parts. First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong. The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes. With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime. So we need to clean this up
so we don't leave extent buffers hanging around on the cache. With this patch
we no longer leak extent buffers on failure to mount. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly. You need to check
and make sure the extent_buffer is uptodate before you use it. This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not. With this we no longer
leak EB's when things go horribly wrong. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level. So add this
to the sanity checks in the endio handler for tree blocks. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
map. Turns out he is using > PAGESIZE leafsizes. The readahead stuff is called
every time we do a completion, but we may not have finished reading in all the
pages, so the bytenr we read off the node could be completely bogus. Fix this
by only calling the readahead hook once all pages have been read in. Thanks,
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
This happened because we unconditionally call end_page_writeback() in the endio
case, which is right. However when we abort the transaction we will call
end_page_writeback() on any writeback pages we find, which is wrong. We need to
lock the page and wait on page writeback to complete if it is. There is nothing
unsafe about this since we are discarding the transaction anyway. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The following case will make the incompat/compat flag of the super block
be recovered.
Task1 |Task2
flags = btrfs_super_incompat_flags(); |
|flags = btrfs_super_incompat_flags();
flags |= new_flag1; |
|flags |= new_flag2;
btrfs_set_super_incompat_flags(flags); |
|btrfs_set_super_incompat_flags(flags);
the new_flag1 is recovered.
In order to avoid this problem, we introduce a lock named super_lock into
the btrfs_fs_info structure. If we want to update incompat/compat flags
of the super block, we must hold it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The original code has one spin_lock 'qgroup_lock' to protect quota
configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY,
it will be added to Btree firstly, and then update configurations in
memory,however, a race condition may happen between these operations.
For example:
->add_qgroup_info_item()
->add_qgroup_rb()
For the above case, del_qgroup_info_item() may happen just before
add_qgroup_rb().
What's worse, when we want to add a qgroup relation:
->add_qgroup_relation_item()
->add_qgroup_relations()
We don't have any checks whether 'src' and 'dst' exist before
add_qgroup_relation_item(), a race condition can also happen for
the above case.
To avoid race condition and have all the necessary checks, we introduce
a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations
protected by the mutex lock.
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery. I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem. The
problem is if your application does something like this
[prealloc][prealloc][prealloc]
the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents. So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space. If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later. The data and such will still work,
but everything else is broken. This patch fixes this by not allowing extents
that are on the modified list to be merged. This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree. So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents. With this patch the testcase I've created no
longer creates a bogus file system after replaying the log. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Each time pick one dead root from the list and let the caller know if
it's needed to continue. This should improve responsiveness during
umount and balance which at some point waits for cleaning all currently
queued dead roots.
A new dead root is added to the end of the list, so the snapshots
disappear in the order of deletion.
The snapshot cleaning work is now done only from the cleaner thread and the
others wake it if needed.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We currently store the first key of the tree block inside the reference for the
tree block in the extent tree. This takes up quite a bit of space. Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref. This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block. In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster. This is not an automatic format change, you
must enable it at mkfs time or with btrfstune. This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
free_root_pointers() has been introduced to cleanup all of tree roots,
so just use it instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We should free leaf and root before returning from the error
handling code.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Now that we use bit operation to check fs_state, update
btrfs_free_fs_root()'s checker, otherwise we get back to
memory leak case.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There are several bugs at error path of create_snapshot() when the
transaction commitment failed.
- access the freed transaction handler. At the end of the
transaction commitment, the transaction handler was freed, so we
should not access it after the transaction commitment.
- we were not aware of the error which happened during the snapshot
creation if we submitted a async transaction commitment.
- pending snapshot access vs pending snapshot free. when something
wrong happened after we submitted a async transaction commitment,
the transaction committer would cleanup the pending snapshots and
free them. But the snapshot creators were not aware of it, they
would access the freed pending snapshots.
This patch fixes the above problems by:
- remove the dangerous code that accessed the freed handler
- assign ->error if the error happens during the snapshot creation
- the transaction committer doesn't free the pending snapshots,
just assigns the error number and evicts them before we unblock
the transaction.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The stripe hash table is large, starting with allocation order 4 and can go as
high as order 7 in case lock debugging is turned on and structure padding
happens.
Observed mount failure:
mount: page allocation failure: order:7, mode:0x200050
Pid: 8234, comm: mount Tainted: G W 3.8.0-default+ #267
Call Trace:
[<ffffffff81114353>] warn_alloc_failed+0xf3/0x140
[<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250
[<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0
[<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840
[<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840
[<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90
[<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
[<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270
[<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs]
[<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs]
[<ffffffff81397289>] ? string+0x49/0xe0
[<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0
[<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs]
[<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200
[<ffffffff81162b90>] mount_fs+0x20/0xe0
[<ffffffff8117db26>] vfs_kern_mount+0x76/0x120
[<ffffffff811801b6>] do_mount+0x386/0x980
[<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80
[<ffffffff81180840>] sys_mount+0x90/0xe0
[<ffffffff81962e99>] system_call_fastpath+0x16/0x1b
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we abort a transaction while fsyncing, we'll skip freeing log roots
part of committing a transaction, which leads to memory leak.
This adds a 'free log roots' in putting super when no more users hold
references on log roots, so it's safe and clean.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
super.magic is an le64 but it's treated as an unterminated string when
compared against BTRFS_MAGIC which is defined as a string. Instead
define BTRFS_MAGIC as a normal hex value and use endian helpers to
compare it to the super's magic.
I tested this by mounting an fs made before the change and made sure
that it didn't introduce sparse errors. This matches a similar cleanup
that is pending in btrfs-progs. David Sterba pointed out that we should
fix the kernel side as well :).
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Miao made the ordered operations stuff run async, which introduced a
deadlock where we could get somebody (sync) racing in and committing the
transaction while a commit was already happening. The new committer would
try and flush ordered operations which would hang waiting for the commit to
finish because it is done asynchronously and no longer inherits the callers
trans handle. To fix this we need to make the ordered operations list a per
transaction list. We can get new inodes added to the ordered operation list
by truncating them and then having another process writing to them, so this
makes it so that anybody trying to add an ordered operation _must_ start a
transaction in order to add itself to the list, which will keep new inodes
from getting added to the ordered operations list after we start committing.
This should fix the deadlock and also keeps us from doing a lot more work
than we need to during commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
open_ctree() need read the metadata to initialize the global information
of btrfs. But it may fail after it submit some bio, and then it will jump
to the error path. Unfortunately, it doesn't check if there are some bios
in flight, and just stop all the worker threads. As a result, when the
submitted bios end, they can not find any worker thread which can deal with
subsequent work, then oops happen.
kernel BUG at fs/btrfs/async-thread.c:605!
Fix this problem by invoking invalidate_inode_pages2() before we stop the
worker threads. This function will wait until the bio end because it need
lock the pages which are going to be invalidated, and if a page is under
disk read IO, it must be locked. invalidate_inode_pages2() need wait until
end bio handler to unlocked it.
Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we abort we've been just free'ing up all the ordered extents and
hoping for the best. This results in lots of warnings from various places,
warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't
fixed. It will also screw up lots of pages who have been set private but
never get cleared because the ordered extents are never allowed to be
submitted. This patch fixes those warnings. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I hit this error when reproducing a bug that would end in a transaction
abort. We take the delayed ref head's mutex to keep anybody from processing
it while we're destroying it, but we fail to drop the mutex before we carry
on and free the damned thing. Fix this by doing the remove logic for the
head ourselves and unlock the mutex, that way we can avoid use after free's
or hung tasks waiting on that mutex to come back so they know the delayed
ref completed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to test the result, we can't get a
null pointer from list_entry()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.
Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.
Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect
fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.
This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.
This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd save us a rbtree search which may become expensive in large filesystem.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Use wrapper page_offset to get byte-offset into filesystem object for page.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The delayed reference allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
No reason to set the path blocking or loop through all of the pages if the
extent buffer isn't actually marked dirty. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is a high traffic function, let's try and do as little as possible
during normal operations shall we?
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch restructure btrfs_run_defrag_inodes() and make the code of the auto
defragment more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
- 'nr' is no more used.
- btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share
a bunch of code.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The current behavior is to allow mounting or remounting a filesystem
writeable in degraded mode if at least one writeable device is
present.
The next failed write access to a missing device which is above
the tolerance of the configured level of redundancy results in an
read-only enforcement. Even without this, the next time
barrier_all_devices() is called and more devices are missing than
tolerable, the switch to read-only mode takes place.
In order to behave predictably and to provide proper feedback to
the user at mount time, this patch compares the number of missing
devices with the number of devices that are tolerated to be missing
according to the configured RAID level. If more devices are missing
than tolerated, e.g. if two devices are missing in case of RAID1,
only a read-only mount and remount is allowed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In csum_dirty_buffer, we first get eb from page->private.
Then we check if the page is the first page of eb. Later
we check it again. Remove the repeated check here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.
In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.
The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help. The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost. The other case is for our tree log. We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With the following debug patch:
static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The btree inode has it's own write cache pages so we can remove this write
cache pages hook as it's not used. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Unnecessary lookup_extent_mapping() is removed because an error is
returned to the caller.
This patch was made based on the advice from Stefan Behrens, thanks.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
As ref cache has been removed from btrfs, there is no user on
its lock and its check.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Sometimes we need choose the method of the reservation according to the type
of the block reservation, such as the reservation for the delayed inode update.
Now we identify the type just by comparing the address of the reservation
variants, it is very ugly if it is a temporary one because we need compare it
with all the common reservation variants. So we add a new "type" field to keep
the type the reservation variants.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Pull btrfs fixes from Chris Mason:
"I've split out the big send/receive update from my last pull request
and now have just the fixes in my for-linus branch. The send/recv
branch will wander over to linux-next shortly though.
The largest patches in this pull are Josef's patches to fix DIO
locking problems and his patch to fix a crash during balance. They
are both well tested.
The rest are smaller fixes that we've had queued. The last rc came
out while I was hacking new and exciting ways to recover from a
misplaced rm -rf on my dev box, so these missed rc3."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: fix that repair code is spuriously executed for transid failures
Btrfs: fix ordered extent leak when failing to start a transaction
Btrfs: fix a dio write regression
Btrfs: fix deadlock with freeze and sync V2
Btrfs: revert checksum error statistic which can cause a BUG()
Btrfs: remove superblock writing after fatal error
Btrfs: allow delayed refs to be merged
Btrfs: fix enospc problems when deleting a subvol
Btrfs: fix wrong mtime and ctime when creating snapshots
Btrfs: fix race in run_clustered_refs
Btrfs: don't run __tree_mod_log_free_eb on leaves
Btrfs: increase the size of the free space cache
Btrfs: barrier before waitqueue_active
Btrfs: fix deadlock in wait_for_more_refs
btrfs: fix second lock in btrfs_delete_delayed_items()
Btrfs: don't allocate a seperate csums array for direct reads
Btrfs: do not strdup non existent strings
Btrfs: do not use missing devices when showing devname
Btrfs: fix that error value is changed by mistake
Btrfs: lock extents as we map them in DIO
...
If verify_parent_transid() fails for all mirrors, the current code
calls repair_io_failure() anyway which means:
- that the disk block is rewritten without repairing anything and
- that a kernel log message is printed which misleadingly claims
that a read error was corrected.
This is an example:
parent transid verify failed on 615015833600 wanted 110423 found 110424
parent transid verify failed on 615015833600 wanted 110423 found 110424
btrfs read error corrected: ino 1 off 615015833600 (dev /dev/...)
It is wrong to ignore the results from verify_parent_transid() and to
call repair_eb_io_failure() when the verification of the transids failed.
This commit fixes the issue.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With commit acce952b0, btrfs was changed to flag the filesystem with
BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
error happened like a write I/O errors of all mirrors.
In such situations, on unmount, the superblock is written in
btrfs_error_commit_super(). This is done with the intention to be able
to evaluate the error flag on the next mount. A warning is printed
in this case during the next mount and the log tree is ignored.
The issue is that it is possible that the superblock points to a root
that was not written (due to write I/O errors).
The result is that the filesystem cannot be mounted. btrfsck also does
not start and all the other btrfs-progs tools fail to start as well.
However, mount -o recovery is working well and does the right things
to recover the filesystem (i.e., don't use the log root, clear the
free space cache and use the next mountable root that is stored in the
root backup array).
This patch removes the writing of the superblock when
BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
flag in the mount function.
These lines can be used to reproduce the issue (using /dev/sdm):
SCRATCH_DEV=/dev/sdm
SCRATCH_MNT=/mnt
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
ls -alLF /dev/mapper/foo
mkfs.btrfs /dev/mapper/foo
mount /dev/mapper/foo $SCRATCH_MNT
echo bar > $SCRATCH_MNT/foo
sync
echo 0 25165824 error | dmsetup reload foo
dmsetup resume foo
ls -alF $SCRATCH_MNT
touch $SCRATCH_MNT/1
ls -alF $SCRATCH_MNT
sleep 35
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
dmsetup resume foo
sleep 1
umount $SCRATCH_MNT
btrfsck /dev/mapper/foo
dmsetup remove foo
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
We need a barrir before calling waitqueue_active otherwise we will miss
wakeups. So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit a168650c introduced a waiting mechanism to prevent busy waiting in
btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
where a tree_mod_seq is held while waiting for the io to complete, while
the end_io calls btrfs_run_delayed_refs.
This whole mechanism is unnecessary. If not enough runnable refs are
available to satisfy count, just return as count is more like a guideline
than a strict requirement.
In case we have to run all refs, commit transaction makes sure that no
other threads are working in the transaction anymore, so we just assert
here that no refs are blocked.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
We convert btrfs_file_aio_write() to use new freeze check. We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.
Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).
CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use the generic printk_get_level() to search a message for a kern_level.
Add __printf to verify format and arguments. Fix a few messages that
had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks
to shrink the object size a bit when not using printk.
[akpm@linux-foundation.org: whitespace tweak]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the kernel portion of btrfs send/receive
Conflicts:
fs/btrfs/Makefile
fs/btrfs/backref.h
fs/btrfs/ctree.c
fs/btrfs/ioctl.c
fs/btrfs/ioctl.h
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.
It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.
Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.
btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.
If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
From btree_read_extent_buffer_pages(), currently repair_io_failure()
can be called with mirror_num being zero when submit_one_bio() returned
an error before. This used to cause a BUG_ON(!mirror_num) in
repair_io_failure() and indeed this is not a case that needs the I/O
repair code to rewrite disk blocks.
This commit prevents calling repair_io_failure() in this case and thus
avoids the BUG_ON() and malfunction.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in.
Bail out with EINVAL.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Init the quota tree along with the others on open_ctree
and close_ctree. Add the quota tree to the list of well
known trees in btrfs_read_fs_root_no_name.
Signed-off-by: Arne Jansen <sensille@gmx.net>
We've got two mechanisms both required for reliable backref resolving (tree
mod log and holding back delayed refs). You cannot make use of one without
the other. So instead of requiring the user of this mechanism to setup both
correctly, we join them into a single interface.
Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
as we did before, which was of no value.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull btrfs updates from Chris Mason:
"I held off on my rc5 pull because I hit an oops during log recovery
after a crash. I wanted to make sure it wasn't a regression because
we have some logging fixes in here.
It turns out that a commit during the merge window just made it much
more likely to trigger directory logging instead of full commits,
which exposed an old bug.
The new backref walking code got some additional fixes. This should
be the final set of them.
Josef fixed up a corner where our O_DIRECT writes and buffered reads
could expose old file contents (not stale, just not the most recent).
He and Liu Bo fixed crashes during tree log recover as well.
Ilya fixed errors while we resume disk balancing operations on
readonly mounts."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: run delayed directory updates during log replay
Btrfs: hold a ref on the inode during writepages
Btrfs: fix tree log remove space corner case
Btrfs: fix wrong check during log recovery
Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
Btrfs: resume balance on rw (re)mounts properly
Btrfs: restore restriper state on all mounts
Btrfs: fix dio write vs buffered read race
Btrfs: don't count I/O statistic read errors for missing devices
Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
Btrfs: fix tree mod log rewind of ADD operations
Btrfs: leave critical region in btrfs_find_all_roots as soon as possible
Btrfs: always put insert_ptr modifications into the tree mod log
Btrfs: fix tree mod log for root replacements at leaf level
Btrfs: support root level changes in __resolve_indirect_ref
Btrfs: avoid waiting for delayed refs when we must not
This introduces btrfs_resume_balance_async(), which, given that
restriper state was recovered earlier by btrfs_recover_balance(),
resumes balance in btrfs-balance kthread.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix a bug that triggered asserts in btrfs_balance() in both normal and
resume modes -- restriper state was not properly restored on read-only
mounts. This factors out resuming code from btrfs_restore_balance(),
which is now also called earlier in the mount sequence to avoid the
problem of some early writes getting the old profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull btrfs fixes from Chris Mason:
"This is a small pull with btrfs fixes. The biggest of the bunch is
another fix for the new backref walking code.
We're still hammering out one btrfs dio vs buffered reads problem, but
that one will have to wait for the next rc."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: delay iput with async extents
Btrfs: add a missing spin_lock
Btrfs: don't assume to be on the correct extent in add_all_parents
Btrfs: introduce btrfs_next_old_item
When fixing up the locking in the delayed ref destruction work I accidently
broke the locking myself ;(. Add back a spin_lock that should be there and
we are now all set. Thanks,
Btrfs: add a missing spin_lock
When fixing up the locking in the delayed ref destruction work I accidently
broke the locking myself ;(. Add back a spin_lock that should be there and
we are now all set. Thanks,
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"The dates look like I had to rebase this morning because there was a
compiler warning for a printk arg that I had missed earlier.
These are all fixes, including one to prevent using stale pointers for
device names, and lots of fixes around transaction abort cleanups
(Josef, Liu Bo).
Jan Schmidt also sent in a number of fixes for the new reference
number tracking code.
Liu Bo beat me to updating the MAINTAINERS file. Since he thought to
also fix the git url, I kept his commit."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: update MAINTAINERS info for BTRFS FILE SYSTEM
Btrfs: destroy the items of the delayed inodes in error handling routine
Btrfs: make sure that we've made everything in pinned tree clean
Btrfs: avoid memory leak of extent state in error handling routine
Btrfs: do not resize a seeding device
Btrfs: fix missing inherited flag in rename
Btrfs: fix incompat flags setting
Btrfs: fix defrag regression
Btrfs: call filemap_fdatawrite twice for compression
Btrfs: keep inode pinned when compressing writes
Btrfs: implement ->show_devname
Btrfs: use rcu to protect device->name
Btrfs: unlock everything properly in the error case for nocow
Btrfs: fix btrfs_destroy_marked_extents
Btrfs: abort the transaction if the commit fails
Btrfs: wake up transaction waiters when aborting a transaction
Btrfs: fix locking in btrfs_destroy_delayed_refs
Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error
Btrfs: fix race in tree mod log addition
Btrfs: add btrfs_next_old_leaf
...
the items of the delayed inodes were forgotten to be freed, this patch
fixes it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we have two trees for recording pinned extents, we need to go through
both of them to make sure that we've done everything clean.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We've forgotten to clear extent states in pinned tree, which will results in
space counter mismatch and memory leak:
WARNING: at fs/btrfs/extent-tree.c:7537 btrfs_free_block_groups+0x1f3/0x2e0 [btrfs]()
...
space_info 2 has 8380416 free, is not full
space_info total=12582912, used=4096, pinned=4096, reserved=0, may_use=0, readonly=4194304
btrfs state leak: start 29364224 end 29376511 state 1 in tree ffff880075f20090 refs 1
...
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
So we're forcing the eb's to have their ref count set to 1 so invalidatepage
works but this breaks lots of things, for example root nodes, and is just
plain wrong, we don't need to just evict all of this stuff. Also drop the
invalidatepage altogether and add a page_cache_release(). With this patch
we no longer hang when trying to access the root nodes after an aborted
transaction and we no longer leak memory. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was getting lots of hung tasks and a NULL pointer dereference because we
are not cleaning up the transaction properly when it aborts. First we need
to reset the running_transaction to NULL so we don't get a bad dereference
for any start_transaction callers after this. Also we cannot rely on
waitqueue_active() since it's just a list_empty(), so just call wake_up()
directly since that will do the barrier for us and such. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The transaction abort stuff was throwing warnings from the list debugging
code because we do a list_del_init outside of the delayed_refs spin lock.
The delayed refs locking makes baby Jesus cry so it's not hard to get wrong,
but we need to take the ref head mutex to make sure it's not being processed
currently, and so if it is we need to drop the spin lock and then take and
drop the mutex and do the search again. If we can take the mutex then we
can safely remove the head from the list and carry on. Now when the
transaction aborts I don't get the list debugging warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Pull btrfs updates from Chris Mason:
"This includes a fairly large change from Josef around data writeback
completion. Before, the writeback wasn't completed until the metadata
insertions for the extent were done, and this made for fairly large
latency spikes on the last page of each ordered extent.
We already had a separate mechanism for tracking pending metadata
insertions, so Josef just needed to tweak things a little to end
writeback earlier on the page. Overall it makes us much friendly to
memory reclaim and lowers latencies quite a lot for synchronous IO.
Jan Schmidt has finished some background work required to track btree
blocks as they go through changes in ownership. It's the missing
piece he needed for both btrfs send/receive and subvolume quotas.
Neither of those are ready yet, but the new tracking code is included
here. Most of the time, the new code is off. It is only used by
scrub and other backref walkers.
Stefan Behrens has added io failure tracking. This includes counters
for which drives are causing the most trouble so the admin (or an
automated tool) can choose to kick them out. We're tracking IO
errors, crc errors, and generation checks we do on each metadata
block.
RAID5/6 did miss the cut this time because I'm having trouble with
corruptions. I'll nail it down next week and post as a beta testing
before 3.6"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (58 commits)
Btrfs: fix tree mod log rewinded level and rewinding of moved keys
Btrfs: fix tree mod log del_ptr
Btrfs: add tree_mod_dont_log helper
Btrfs: add missing spin_lock for insertion into tree mod log
Btrfs: add inodes before dropping the extent lock in find_all_leafs
Btrfs: use delayed ref sequence numbers for all fs-tree updates
Btrfs: fix false positive in check-integrity on unmount
Btrfs: fix runtime warning in check-integrity check data mode
Btrfs: set ioprio of scrub readahead to idle
Btrfs: fix return code in drop_objectid_items
Btrfs: check to see if the inode is in the log before fsyncing
Btrfs: return value of btrfs_read_buffer is checked correctly
Btrfs: read device stats on mount, write modified ones during commit
Btrfs: add ioctl to get and reset the device stats
Btrfs: add device counters for detected IO and checksum errors
btrfs: Drop unused function btrfs_abort_devices()
Btrfs: fix the same inode id problem when doing auto defragment
Btrfs: fall back to non-inline if we don't have enough space
Btrfs: fix how we deal with the orphan block rsv
Btrfs: convert the inode bit field to use the actual bit operations
...
The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
1) This function is not used anywhere.
2) Using the blk_abort_queue() to abort the queue seems not correct.
blk_abort_queue() is used for timeout handling (block/blk-timeout.c).
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,
Btrfs: fix how we deal with the orphan block rsv
Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Miao pointed this out while I was working on an orphan problem that messing
with a bitfield where different ranges are protected by different locks
doesn't work out right. Turns out we've been doing this forever where we
have different parts of the bit field protected by either no lock at all or
different locks which could cause all sorts of weird problems including the
issue I was hitting. So instead make a runtime_flags thing that we use the
normal bit operations on that are all atomic so we can keep having our
no/different locking for the different flags and then make force_compress
it's own thing so it can be treated normally. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We noticed that the ordered extent completion doesn't really rely on having
a page and that it could be done independantly of ending the writeback on a
page. This patch makes us not do the threaded endio stuff for normal
buffered writes and direct writes so we can end page writeback as soon as
possible (in irq context) and only start threads to do the ordered work when
it is actually done. Compression needs to be reworked some to take
advantage of this as well, but atm it has to do a find_get_page in its endio
handler so it must be done in its own thread. This makes direct writes
quite a bit faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed
parameter for_cow = 1. In fact, these two functions should never mark
their tree modification operations as for_cow, because they can change
the number of blocks referenced by a tree.
Hence, we remove the extra for_cow parameter from these functions and
make them pass a zero down.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.
But, a few callers are using it with spinlocks held. Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case. This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Our code is not ready to cope with a sectorsize that's not equal to PAGE_SIZE.
It will lead to hanging-on while writing something.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Merge with latest Linus' tree, as I have incoming patches
that fix code that is newer than current HEAD of for-next.
Conflicts:
drivers/net/ethernet/realtek/r8169.c
Dave Sterba had put in patches to look for mixed data/metadata groups
with metadata bigger than 4KB. But these ended up in the wrong place
and it wasn't testing the feature flag correctly.
This updates the tests to make sure our sizes are matching
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs puts the filesystem metadata into its own address space, and
somehow the block device address space isn't getting onto disk properly
before a mount. The end result is that a loop of mkfs and mounting the
filesystem will sometimes find stale or incorrect data.
This commit should fix it by sprinkling fdatawrites and invalidate_bdev
calls around. This is a short term measure to make sure it is fixed.
The block devices really should be flushed and cleaned up higher in the
stack.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With support for bigger metadata blocks, we must avoid mounting a
filesystem with different block size for mixed block groups, this causes
corruption (found by xfstests/083).
Signed-off-by: David Sterba <dsterba@suse.cz>
Since we need to read and write extent buffers in their entirety we can't use
the normal bio_readpage_error stuff since it only works on a per page basis. So
instead make it so that if we see an io error in endio we just mark the eb as
having an IO error and then in btree_read_extent_buffer_pages we will manually
try other mirrors and then overwrite the bad mirror if we find a good copy.
This works with larger than page size blocks. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The metadata write IO completion code is now simple enough that we
don't need the threaded helpers anymore.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch simplifies how we track our extent buffers. Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory. So instead of evicting these pages, we
could end up evicting things we actually care about. Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks. This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We spend a lot of time looking up extent buffers from pages when we could just
store the pointer to the eb the page is associated with in page->private. This
patch does just that, and it makes things a little simpler and reduces a bit of
CPU overhead involved with doing metadata IO. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling. This fixes the code to properly support
large metadata blocks again.
Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.
This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a filesystem got aborted due do error, transaction_kthread() will
busyloop. Fix it by going to sleep in that case as well. Maybe we should
just stop transaction_kthread() when filesystem is aborted but that would be
more complex.
Signed-off-by: Jan Kara <jack@suse.cz>
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.
This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
argument and use GFP_NOFS consistently.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.
It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
find_and_setup_root BUGs when it encounters an error from
btrfs_find_last_root, which can occur if a path can't be allocated.
This patch pushes it up to its callers where it is already handled.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The only error condition in clean_tree_block is an accounting bug.
Returning without modifying dirty_metadata_bytes and as if the cleaning
as been performed may cause problems later so it should panic instead.
It should probably be a BUG_ON but we have btrfs_panic now.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Quoth Chris:
"This is later than I wanted because I got backed up running through
btrfs bugs from the Oracle QA teams. But they are all bug fixes that
we've queued and tested since rc1.
Nothing in particular stands out, this just reflects bug fixing and QA
done in parallel by all the btrfs developers. The most user visible
of these is:
Btrfs: clear the extent uptodate bits during parent transid failures
Because that helps deal with out of date drives (say an iscsi disk
that has gone away and come back). The old code wasn't always
properly retrying the other mirror for this type of failure."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix compiler warnings on 32 bit systems
Btrfs: increase the global block reserve estimates
Btrfs: clear the extent uptodate bits during parent transid failures
Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
Btrfs: make sure we update latest_bdev
Btrfs: improve error handling for btrfs_insert_dir_item callers
Btrfs: be less strict on finding next node in clear_extent_bit
Btrfs: fix a bug on overcommit stuff
Btrfs: kick out redundant stuff in convert_extent_bit
Btrfs: skip states when they does not contain bits to clear
Btrfs: check return value of lookup_extent_mapping() correctly
Btrfs: fix deadlock on page lock when doing auto-defragment
Btrfs: fix return value check of extent_io_ops
btrfs: honor umask when creating subvol root
btrfs: silence warning in raid array setup
btrfs: fix structs where bitfields and spinlock/atomic share 8B word
btrfs: delalloc for page dirtied out-of-band in fixup worker
Btrfs: fix memory leak in load_free_space_cache()
btrfs: don't check DUP chunks twice
Btrfs: fix trim 0 bytes after a device delete
...
When we are setting up the mount, we close all the
devices that were not actually part of the metadata we found.
But, we don't make sure that one of those devices wasn't
fs_devices->latest_bdev, which means we can do a use after free
on the one we closed.
This updates latest_bdev as it goes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Gracefully fail when trying to mount a BTRFS file system that has a
sectorsize smaller than PAGE_SIZE.
On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
then boot into a 64K PAGE_SIZE kernel. Presently open_ctree fails in an
endless loop and hangs the machine in this situation.
My debugging has show this Sector size < Page size to be a non trivial
situation and a graceful exit from the situation would be nice for the
time being.
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix reservations in btrfs_page_mkwrite
Btrfs: advance window_start if we're using a bitmap
btrfs: mask out gfp flags in releasepage
Btrfs: fix enospc error caused by wrong checks of the chunk
Btrfs: do not defrag a file partially
Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
Btrfs: use cluster->window_start when allocating from a cluster bitmap
Btrfs: Check for NULL page in extent_range_uptodate
btrfs: Fix busyloops in transaction waiting code
Btrfs: make sure a bitmap has enough bytes
Btrfs: fix uninit warning in backref.c
btree_releasepage is a callback and can be passed unknown gfp flags and then
they may end up in kmem_cache_alloc called from alloc_extent_state, slab
allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.
This may happen when btrfs is mounted from a loop device, which masks out
__GFP_IO flag. The check in try_release_extent_state
3399 if ((mask & GFP_NOFS) == GFP_NOFS)
3400 mask = GFP_NOFS;
will not work and passes unfiltered flags further resulting in crash at
mm/slab.c:2963
[<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
[<000000000024c810>] kmem_cache_alloc+0x204/0x294
[<00000000001fd3c2>] mempool_alloc+0x52/0x170
[<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
[<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
[<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
[<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
[<0000000000210d84>] shrink_page_list+0x6a0/0x724
[<0000000000211394>] shrink_inactive_list+0x230/0x578
[<0000000000211bb8>] shrink_list+0x6c/0x120
[<0000000000211e4e>] shrink_zone+0x1e2/0x228
[<0000000000211f24>] shrink_zones+0x90/0x254
[<0000000000213410>] do_try_to_free_pages+0xac/0x420
[<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
[<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
[<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
btrfs: take allocation of ->tree_root into open_ctree()
btrfs: let ->s_fs_info point to fs_info, not root...
btrfs: consolidate failure exits in btrfs_mount() a bit
btrfs: make free_fs_info() call ->kill_sb() unconditional
btrfs: merge free_fs_info() calls on fill_super failures
btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
btrfs: make open_ctree() return int
btrfs: sanitizing ->fs_info, part 5
btrfs: sanitizing ->fs_info, part 4
btrfs: sanitizing ->fs_info, part 3
btrfs: sanitizing ->fs_info, part 2
btrfs: sanitizing ->fs_info, part 1
btrfs: fix a deadlock in btrfs_scan_one_device()
btrfs: fix mount/umount race
btrfs: get ->kill_sb() of its own
btrfs: preparation to fixing mount/umount race
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
Btrfs: use larger system chunks
Btrfs: add a delalloc mutex to inodes for delalloc reservations
Btrfs: space leak tracepoints
Btrfs: protect orphan block rsv with spin_lock
Btrfs: add allocator tracepoints
Btrfs: don't call btrfs_throttle in file write
Btrfs: release space on error in page_mkwrite
Btrfs: fix btrfsck error 400 when truncating a compressed
Btrfs: do not use btrfs_end_transaction_throttle everywhere
Btrfs: add balance progress reporting
Btrfs: allow for resuming restriper after it was paused
Btrfs: allow for canceling restriper
Btrfs: allow for pausing restriper
Btrfs: add skip_balance mount option
Btrfs: recover balance on mount
Btrfs: save balance parameters to disk
Btrfs: soft profile changing mode (aka soft convert)
Btrfs: implement online profile changing
Btrfs: do not reduce profile in do_chunk_alloc()
Btrfs: virtual address space subset filter
...
Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
mnt_drop_write_file() helper.
Implement an ioctl for canceling restriper. Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit. Balance item is deleted and no memory
about the interrupted balance is kept.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement an ioctl for pausing restriper. This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc. If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.
Add a hook to close_ctree() to pause restriper and free its data
structures on unmount. (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
On mount, if balance item is found, resume balance in a separate
kernel thread.
Try to be smart to continue roughly where previous balance (or convert)
was interrupted. For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc. The semantics of the old balancing
ioctl are fully preserved.
Explicitly disallow any volume operations when balance is in progress.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:
open_ctree()
lock(chunk_mutex);
read_chunk_tree();
read_one_dev();
open_seed_devices();
lock(uuid_mutex);
and then we hit a lockdep splat.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
the latter can be obtained from the former (by looking as ->tree_root)
just as cheaply as we currently are doing the other way round.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It returns either ERR_PTR(-ve) or sb->s_fs_info. The latter can
be found by caller just as well, TYVM, no need to return it. Just
return -ve or 0...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
close_ctree() uses a weird mix of accesses to root->fs_info and
its value at the beginning of function stored in local variable.
Since ->fs_info *never* changes, let's just use the local variable
to avoid confusion.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A new helper: btrfs_alloc_root(fs_info); allocates btrfs_root
and sets ->fs_info. All places allocating the suckers converted
to it. At that point we *never* reassign ->fs_info of btrfs_root;
it's set before anyone sees the address of newly allocated
struct btrfs_root and never assigned anywhere else.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
move assignments to ->fs_info in open_ctree() up, to the place
just after the original allocations. Assignment for tree_root
becomes a no-op - we'd obtained fs_info from tree_root->fs_info
in the first place.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need fs_info and root to live until the moment when the victim
superblock leaves the list, so we need to postpone free_fs_info()
until after ->put_super(). The call is buried in close_ctree(),
though, so we need to lift it into the callers (including
btrfs_put_super()) first.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.
Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.
Also pass in the fs_info for later use.
btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...
Conflicts:
kernel/cgroup_freezer.c
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.
Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: unplug every once and a while
Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code
Btrfs: only set cache_generation if we setup the block group
Btrfs: don't panic if orphan item already exists
Btrfs: fix leaked space in truncate
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Btrfs: deal with enospc from dirtying inodes properly
Btrfs: fix num_workers_starting bug and other bugs in async thread
BTRFS: Establish i_ops before calling d_instantiate
Btrfs: add a cond_resched() into the worker loop
Btrfs: fix ctime update of on-disk inode
btrfs: keep orphans for subvolume deletion
Btrfs: fix inaccurate available space on raid0 profile
Btrfs: fix wrong disk space information of the files
Btrfs: fix wrong i_size when truncating a file to a larger size
Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror
* 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: lower the dirty balance poll interval
Al pointed out we have some random problems with the way we account for
num_workers_starting in the async thread stuff. First of all we need to make
sure to decrement num_workers_starting if we fail to start the worker, so make
__btrfs_start_workers do this. Also fix __btrfs_start_workers so that it
doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
failed to create a worker. Also check_pending_worker_creates needs to call
__btrfs_start_work in it's work function since it already increments
num_workers_starting.
People only start one worker at a time, so get rid of the num_workers argument
everywhere, and make btrfs_queue_worker a void since it will always succeed.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
freezer: fix wait_event_freezable/__thaw_task races
freezer: kill unused set_freezable_with_signal()
dmatest: don't use set_freezable_with_signal()
usb_storage: don't use set_freezable_with_signal()
freezer: remove unused @sig_only from freeze_task()
freezer: use lock_task_sighand() in fake_signal_wake_up()
freezer: restructure __refrigerator()
freezer: fix set_freezable[_with_signal]() race
freezer: remove should_send_signal() and update frozen()
freezer: remove now unused TIF_FREEZE
freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
cgroup_freezer: prepare for removal of TIF_FREEZE
freezer: clean up freeze_processes() failure path
freezer: kill PF_FREEZING
freezer: test freezable conditions while holding freezer_lock
freezer: make freezing indicate freeze condition in effect
freezer: use dedicated lock instead of task_lock() + memory barrier
freezer: don't distinguish nosig tasks on thaw
freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
freezer: rename thaw_process() to __thaw_task() and simplify the implementation
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove free-space-cache.c WARN during log replay
Btrfs: sectorsize align offsets in fiemap
Btrfs: clear pages dirty for io and set them extent mapped
Btrfs: wait on caching if we're loading the free space cache
Btrfs: prefix resize related printks with btrfs:
btrfs: fix stat blocks accounting
Btrfs: avoid unnecessary bitmap search for cluster setup
Btrfs: fix to search one more bitmap for cluster setup
btrfs: mirror_num should be int, not u64
btrfs: Fix up 32/64-bit compatibility for new ioctls
Btrfs: fix barrier flushes
Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
There is no reason to export two functions for entering the
refrigerator. Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.
* Rename refrigerator() to __refrigerator() and make it return bool
indicating whether it scheduled out for freezing.
* Update try_to_freeze() to return bool and relay the return value of
__refrigerator() if freezing().
* Convert all refrigerator() users to try_to_freeze().
* Update documentation accordingly.
* While at it, add might_sleep() to try_to_freeze().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.
But, we have two bugs in the way these are sent down. When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.
In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.
Both of these bugs can cause corruptions on power failures. We fix it
with some new code to send down empty barriers to all devices before
writing the first super.
Alexandre Oliva found the multi-device bug. Arne Jansen did the async
barrier loop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: rename the option to nospace_cache
Btrfs: handle bio_add_page failure gracefully in scrub
Btrfs: fix deadlock caused by the race between relocation
Btrfs: only map pages if we know we need them when reading the space cache
Btrfs: fix orphan backref nodes
Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
Btrfs: fix unreleased path in btrfs_orphan_cleanup()
Btrfs: fix no reserved space for writing out inode cache
Btrfs: fix nocow when deleting the item
Btrfs: tweak the delayed inode reservations again
Btrfs: rework error handling in btrfs_mount()
Btrfs: close devices on all error paths in open_ctree()
Btrfs: avoid null dereference and leaks when bailing from open_ctree()
Btrfs: fix subvol_name leak on error in btrfs_mount()
Btrfs: fix memory leak in btrfs_parse_early_options()
Btrfs: fix our reservations for updating an inode when completing io
Btrfs: fix oops on NULL trans handle in btrfs_truncate
btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree(). fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix bugs introduced by 6c41761f. Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info(). Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.
Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
Btrfs: check for a null fs root when writing to the backup root log
Btrfs: fix race during transaction joins
Btrfs: fix a potential btrfs_bio leak on scrub fixups
Btrfs: rename btrfs_bio multi -> bbio for consistency
Btrfs: stop leaking btrfs_bios on readahead
Btrfs: stop the readahead threads on failed mount
Btrfs: fix extent_buffer leak in the metadata IO error handling
Btrfs: fix the new inspection ioctls for 32 bit compat
Btrfs: fix delayed insertion reservation
Btrfs: ClearPageError during writepage and clean_tree_block
Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
Btrfs: make a delayed_block_rsv for the delayed item insertion
Btrfs: add a log of past tree roots
btrfs: separate superblock items out of fs_info
Btrfs: use the global reserve when truncating the free space cache inode
Btrfs: release metadata from global reserve if we have to fallback for unlink
Btrfs: make sure to flush queued bios if write_cache_pages waits
Btrfs: fix extent pinning bugs in the tree log
Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
Btrfs: don't wait as long for more batches during SSD log commit
...
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff. It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff. So instead create a seperate block rsv for the delayed
insertion stuff. This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.
It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)
Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.
Signed-off-by: David Sterba <dsterba@suse.cz>
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.
Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases. There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve. However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit. So this
patch adds chunk free space accounting so we always know how much unallocated
space we have. Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit. In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space. This makes this fio test
[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test
go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system. This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit. This isn't
bad, but it makes us constantly have to re-read the inodes back. So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.
Changes for v2:
- eliminate race condition
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add state information for readahead to btrfs_fs_info and btrfs_device
Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.
Changes v2:
- use extent buffer flags instead of extent state flags
Changes v5:
- adapt to changed read_extent_buffer_pages interface
- don't return eb from reada_tree_block_flagged if it has CORRUPT flag set
Signed-off-by: Arne Jansen <sensille@gmx.net>
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.
Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK
Change v6:
- fix bug introduced in v5
Signed-off-by: Arne Jansen <sensille@gmx.net>
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
Btrfs: use the commit_root for reading free_space_inode crcs
Btrfs: reduce extent_state lock contention for metadata
Btrfs: remove lockdep magic from btrfs_next_leaf
Btrfs: make a lockdep class for each root
Btrfs: switch the btrfs tree locks to reader/writer
Btrfs: fix deadlock when throttling transactions
Btrfs: stop using highmem for extent_buffers
Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
Btrfs: tag pages for writeback in sync
Btrfs: fix enospc problems with delalloc
Btrfs: don't flush delalloc arbitrarily
Btrfs: use find_or_create_page instead of grab_cache_page
Btrfs: use a worker thread to do caching
Btrfs: fix how we merge extent states and deal with cached states
Btrfs: use the normal checksumming infrastructure for free space cache
Btrfs: serialize flushers in reserve_metadata_bytes
Btrfs: do transaction space reservation before joining the transaction
Btrfs: try to only do one btrfs_search_slot in do_setxattr
This patch was originally from Tejun Heo. lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time. The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.
This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.
The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.
This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported a deadlock when copying a bunch of files. This is because they
were low on memory and kthreadd got hung up trying to migrate pages for an
allocation when starting the caching kthread. The page was locked by the person
starting the caching kthread. To fix this we just need to use the async thread
stuff so that the threads are already created and we don't have to worry about
deadlocks. Thanks,
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <josef@redhat.com>
When allocation fails in btrfs_read_fs_root_no_name, ret is not set
although it is returned, holding a garbage value.
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Removes code no longer used. The sysfs file itself is kept, because the
btrfs developers expressed interest in putting new entries to sysfs.
Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent commit to get rid of our trans_mutex introduced
some races with block group relocation. The problem is that relocation
needs to do some record keeping about each root, and it was relying
on the transaction mutex to coordinate things in subtle ways.
This fix adds a mutex just for the relocation code and makes sure
it doesn't have a big impact on normal operations. The race is
really fixed in btrfs_record_root_in_trans, which is where we
step back and wait for the relocation code to finish accounting
setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
list_splice_init will make delalloc_inodes empty, but without a spinlock
around, this may produce corrupted list head, accessed in many placess,
The race window is very tight and nobody seems to have hit it so far.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.
Signed-off-by: Arne Jansen <sensille@gmx.net>
write_dev_supers was changed to use RCU to protect the list of
devices, but it was then sleeping while it actually wrote the supers.
This fixes it to just use the mutex, since we really don't any
concurrency in write_dev_supers anyway.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This will detect small random writes into files and
queue the up for an auto defrag process. It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices
On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
240f62c875 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We use trans_mutex for lots of things, here's a basic list
1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists
Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following
1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.
I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :). So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.
This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.
Signed-off-by: David Sterba <dsterba@suse.cz>
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.
text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all
Signed-off-by: David Sterba <dsterba@suse.cz>
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.
Signed-off-by: David Sterba <dsterba@suse.cz>
The superblock's ->s_fs_info is properly set in btrfs_fill_super, after
a call to open_ctree, which derefereces it before check. Although
tree_root is set via btrfs_set_super, let's be defensive and leave the
check in place.
Signed-off-by: David Sterba <dsterba@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: cleanup error handling in inode.c
Btrfs: put the right bio if we have an error
Btrfs: free bitmaps properly when evicting the cache
Btrfs: Free free_space item properly in btrfs_trim_block_group()
btrfs: add missing spin_unlock to a rare exit path
Btrfs: check return value of kmalloc()
btrfs: fix wrong allocating flag when reading page
Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
This is similar to block group caching.
We dedicate a special inode in fs tree to save free ino cache.
At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.
To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: fix free space cache leak
Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
Btrfs end_bio_extent_readpage should look for locked bits
Btrfs: don't force chunk allocation in find_free_extent
Btrfs: Check validity before setting an acl
Btrfs: Fix incorrect inode nlink in btrfs_link()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
Btrfs: make uncache_state unconditional
btrfs: using cached extent_state in set/unlock combinations
Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
Btrfs: fix subvolume mount by name problem when default mount subvolume is set
fix user annotation in ioctl.c
Btrfs: check for duplicate iov_base's when doing dio reads
btrfs: properly handle overlapping areas in memmove_extent_buffer
Btrfs: fix memory leaks in btrfs_new_inode()
Btrfs: check for duplicate iov_base's when doing dio reads
Btrfs: reuse the extent_map we found when calling btrfs_get_extent
Btrfs: do not use async submit for small DIO io's
Btrfs: don't split dio bios if we don't have to
...
I've been working on making our O_DIRECT latency not suck and I noticed we were
taking the trans_mutex in btrfs_end_transaction. So to do this we convert
num_writers and use_count to atomic_t's and just decrement them in
btrfs_end_transaction. Instead of deleting the transaction from the trans list
in put_transaction we do that in btrfs_commit_transaction() since that's the
only time it actually needs to be removed from the list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: don't warn in btrfs_add_orphan
Btrfs: fix free space cache when there are pinned extents and clusters V2
Btrfs: Fix uninitialized root flags for subvolumes
btrfs: clear __GFP_FS flag in the space cache inode
Btrfs: fix memory leak in start_transaction()
Btrfs: fix memory leak in btrfs_ioctl_start_sync()
Btrfs: fix subvol_sem leak in btrfs_rename()
Btrfs: Fix oops for defrag with compression turned on
Btrfs: fix /proc/mounts info.
Btrfs: fix compiler warning in file.c
root_item->flags and root_item->byte_limit are not initialized when
a subvolume is created. This bug is not revealed until we added
readonly snapshot support - now you mount a btrfs filesystem and you
may find the subvolumes in it are readonly.
To work around this problem, we steal a bit from root_item->inode_item->flags,
and use it to indicate if those fields have been properly initialized.
When we read a tree root from disk, we check if the bit is set, and if
not we'll set the flag and initialize the two fields of the root item.
Reported-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Andreas Philipp <philipp.andreas@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
Btrfs: fix __btrfs_map_block on 32 bit machines
btrfs: fix possible deadlock by clearing __GFP_FS flag
btrfs: check link counter overflow in link(2)
btrfs: don't mess with i_nlink of unlocked inode in rename()
Btrfs: check return value of btrfs_alloc_path()
Btrfs: fix OOPS of empty filesystem after balance
Btrfs: fix memory leak of empty filesystem after balance
Btrfs: fix return value of setflags ioctl
Btrfs: fix uncheck memory allocations
btrfs: make inode ref log recovery faster
Btrfs: add btrfs_trim_fs() to handle FITRIM
Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
Btrfs: make update_reserved_bytes() public
btrfs: return EXDEV when linking from different subvolumes
Btrfs: Per file/directory controls for COW and compression
Btrfs: add datacow flag in inode flag
btrfs: use GFP_NOFS instead of GFP_KERNEL
Btrfs: check return value of read_tree_block()
btrfs: properly access unaligned checksum buffer
...
Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause
deadlock.
Task1
open()
...
btrfs_search_slot()
...
btrfs_cow_block()
...
alloc_page()
wait for reclaiming
shrink_slab()
...
shrink_icache_memory()
...
btrfs_evict_inode()
...
btrfs_search_slot()
If the path is locked by task1, the deadlock happens.
So the btree's page cache is different with the file's page cache, it can not
allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in
GFP_HIGHUSER_MOVABLE flag.
Reported-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs will remove unused block groups after balance.
When a empty filesystem is balanced, the block group with tag "DATA" may be
dropped, and after umount and mount again, it will not find "DATA" space_info
and lead to OOPS.
So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS.
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote:
> Thanks for fielding this one. Does put_unaligned_le32 optimize away on
> platforms with efficient access? It would be great if we didn't need
> the #ifdef.
(quicktest: assembly output is same for put_unaligned_le32 and direct
assignment on my x86_64)
I was originally following examples in
Documentation/unaligned-memory-access.txt. From other code it seems to me that
the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger
portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via
arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use
put_unaligned_le32 without the ifdef.
dave
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for. If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either. This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
mount. It sucks, but hey it's better than hanging. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata. Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.
This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (33 commits)
Btrfs: Fix page count calculation
btrfs: Drop __exit attribute on btrfs_exit_compress
btrfs: cleanup error handling in btrfs_unlink_inode()
Btrfs: exclude super blocks when we read in block groups
Btrfs: make sure search_bitmap finds something in remove_from_bitmap
btrfs: fix return value check of btrfs_start_transaction()
btrfs: checking NULL or not in some functions
Btrfs: avoid uninit variable warnings in ordered-data.c
Btrfs: catch errors from btrfs_sync_log
Btrfs: make shrink_delalloc a little friendlier
Btrfs: handle no memory properly in prepare_pages
Btrfs: do error checking in btrfs_del_csums
Btrfs: use the global block reserve if we cannot reserve space
Btrfs: do not release more reserved bytes to the global_block_rsv than we need
Btrfs: fix check_path_shared so it returns the right value
btrfs: check return value of btrfs_start_ioctl_transaction() properly
btrfs: fix return value check of btrfs_join_transaction()
fs/btrfs/inode.c: Add missing IS_ERR test
btrfs: fix missing break in switch phrase
btrfs: fix several uncheck memory allocations
...
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
is added, and the mistake of the error check in several places is
corrected.
For more stable Btrfs, I think that we should reduce BUG_ON().
But, I think that long time is necessary for this.
So, I propose this patch as a short-term solution.
With this patch:
- To more stable Btrfs, the part that should be corrected is clarified.
- The panic isn't done by the NULL pointer reference etc. (even if
BUG_ON() is increased temporarily)
- The error code is returned in the place where the error can be easily
returned.
As a long-term plan:
- BUG_ON() is reduced by using the forced-readonly framework, etc.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
Btrfs: forced readonly mounts on errors
btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
btrfs: check NULL or not
btrfs: Don't pass NULL ptr to func that may deref it.
btrfs: mount failure return value fix
btrfs: Mem leak in btrfs_get_acl()
btrfs: fix wrong free space information of btrfs
btrfs: make the chunk allocator utilize the devices better
btrfs: restructure find_free_dev_extent()
btrfs: fix wrong calculation of stripe size
btrfs: try to reclaim some space when chunk allocation fails
btrfs: fix wrong data space statistics
fs/btrfs: Fix build of ctree
Btrfs: fix off by one while setting block groups readonly
Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
Btrfs: Add readonly snapshots support
Btrfs: Refactor btrfs_ioctl_snap_create()
btrfs: Extract duplicate decompress code
...
This patch comes from "Forced readonly mounts on errors" ideas.
As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.
The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.
After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>