2
0
mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-09-04 20:19:47 +08:00
Commit Graph

756 Commits

Author SHA1 Message Date
Jeff Mahoney
5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Jeff Mahoney
cea67ab92d btrfs: clean the old superblocks before freeing the device
btrfs_rm_device frees the block device but then re-opens it using
the saved device name.  A race exists between the close and the
re-open that allows the block size to be changed.  The result
is getting stuck forever in the reclaim loop in __getblk_slow.

This patch moves the superblock cleanup before closing the block
device, which is also consistent with other callers.  We also don't
need a private copy of dev_name as the whole routine operates under
the uuid_mutex.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Josef Bacik
afcdd129e0 Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Naohiro Aota
5d8eb6fe51 btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs
Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But
the work can be done by btrfs_delete_unused_bgs() (and it's better since it
trim the BG). Let's dedupe the code.

While btrfs_delete_unused_bgs() is already hitting the relocated BG, it
skip the BG since the BG has "ro" flag set (to keep balancing BG intact).
On the other hand, btrfs cannot drop "ro" flag here to prevent additional
writes. So this patch make use of "removed" flag.
btrfs_delete_unused_bgs() now detect the flag to distinguish whether a
read-only BG is relocating or not.

Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 17:59:49 +02:00
Linus Torvalds
28687b935e Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few different fixes in here.  These range from
  enospc corners to fsync and quota fixes, and a few targeted at error
  handling for corrupt metadata/fuzzing"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix lockdep warning on deadlock against an inode's log mutex
  Btrfs: detect corruption when non-root leaf has zero item
  Btrfs: check btree node's nritems
  btrfs: don't create or leak aliased root while cleaning up orphans
  Btrfs: fix em leak in find_first_block_group
  btrfs: do not background blkdev_put()
  Btrfs: clarify do_chunk_alloc()'s return value
  btrfs: fix fsfreeze hang caused by delayed iputs deal
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
  btrfs: qgroup: Fix qgroup incorrectness caused by log replay
  btrfs: relocation: Fix leaking qgroups numbers on data extents
  btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
  btrfs: waiting on qgroup rescan should not always be interruptible
  btrfs: properly track when rescan worker is running
  btrfs: flush_space: treat return value of do_chunk_alloc properly
  Btrfs: add ASSERT for block group's memory leak
  btrfs: backref: Fix soft lockup in __merge_refs function
  Btrfs: fix memory leak of reloc_root
2016-08-26 20:22:01 -07:00
Anand Jain
1423881941 btrfs: do not background blkdev_put()
At the end of unmount/dev-delete, if the device exclusive open is not
actually closed, then there might be a race with another program in
the userland who is trying to open the device in exclusive mode and
it may fail for eg:
      unmount /btrfs; fsck /dev/x
      btrfs dev del /dev/x /btrfs; fsck /dev/x
so here background blkdev_put() is not a choice

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-08-25 03:58:28 -07:00
Jens Axboe
1eff9d322a block: rename bio bi_rw to bi_opf
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Linus Torvalds
d58b0d980f Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of my btrfs pull, which is some cleanups and a batch
  of fixes.

  Most of the code here is from Jeff Mahoney, making the pointers we
  pass around internally more consistent and less confusing overall.  I
  noticed a small problem right before I sent this out yesterday, so I
  fixed it up and re-tested overnight"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
  Btrfs: fix __MAX_CSUM_ITEMS
  btrfs: btrfs_abort_transaction, drop root parameter
  btrfs: add btrfs_trans_handle->fs_info pointer
  btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
  btrfs: convert nodesize macros to static inlines
  btrfs: introduce BTRFS_MAX_ITEM_SIZE
  btrfs: cleanup, remove prototype for btrfs_find_root_ref
  btrfs: copy_to_sk drop unused root parameter
  btrfs: simpilify btrfs_subvol_inherit_props
  btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
  btrfs: tests, require fs_info for root
  btrfs: tests, move initialization into tests/
  btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
  btrfs: prefix fsid to all trace events
  btrfs: plumb fs_info into btrfs_work
  btrfs: remove obsolete part of comment in statfs
  btrfs: hide test-only member under ifdef
  btrfs: Ratelimit "no csum found" info message
  btrfs: Add ratelimit to btrfs printing
  Btrfs: fix unexpected balance crash due to BUG_ON
  ...
2016-08-04 19:56:16 -04:00
Linus Torvalds
d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Jeff Mahoney
66642832f0 btrfs: btrfs_abort_transaction, drop root parameter
__btrfs_abort_transaction doesn't use its root parameter except to
obtain an fs_info pointer.  We can obtain that from trans->root->fs_info
for now and from trans->fs_info in a later patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:26 +02:00
Jeff Mahoney
05f9a78012 btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
In btrfs_relocate_chunk, we get a transaction handle via
btrfs_start_trans_remove_block_group, which starts the transaction
using the extent root.  When we call btrfs_end_transaction, we're calling
it using the chunk root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:25 +02:00
Jeff Mahoney
14a1e067b4 btrfs: introduce BTRFS_MAX_ITEM_SIZE
We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in
several places.  This introduces a BTRFS_MAX_ITEM_SIZE macro to do the
same.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:54:24 +02:00
Jeff Mahoney
3cdde2240d btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
btrfs_test_opt and friends only use the root pointer to access
the fs_info.  Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:16 +02:00
Liu Bo
5a488b9d2c Btrfs: fix unexpected balance crash due to BUG_ON
Mounting a btrfs can resume previous balance operations asynchronously.
An user got a crash when one drive has some corrupt sectors.

Since balance can cancel itself in case of any error, we can gracefully
return errors to upper layers and let balance do the cancel job.

Reported-by: sash <master.b.at.raven@chefmail.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Anand Jain
e2bf6e89b4 btrfs: make sure device is synced before return
An inconsistent behavior due to stale reads from the
disk was reported

  mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html

This patch will make sure devices are synced before
return in the unmount thread.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Anand Jain
f448341af9 btrfs: reorg btrfs_close_one_device()
Moves closer to the caller and removes declaration

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Vincent Stehlé
df5c82a8dc Btrfs: fix comparison in __btrfs_map_block()
Add missing comparison to op in expression, which was forgotten when doing
the REQ_OP transition.

Fixes: b3d3fa5199 ("btrfs: update __btrfs_map_block for REQ_OP transition")
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-18 15:28:23 -06:00
Linus Torvalds
b971712afc Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have a two part pull this time because one of the patches Dave
  Sterba collected needed to be against v4.7-rc2 or higher (we used
  rc4).  I try to make my for-linus-xx branch testable on top of the
  last major so we can hand fixes to people on the list more easily, so
  I've split this pull in two.

  This first part has some fixes and two performance improvements that
  we've been testing for some time.

  Josef's two performance fixes are most notable.  The transid tracking
  patch makes a big improvement on pretty much every workload"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: Force stripesize to the value of sectorsize
  btrfs: fix disk_i_size update bug when fallocate() fails
  Btrfs: fix error handling in map_private_extent_buffer
  Btrfs: fix error return code in btrfs_init_test_fs()
  Btrfs: don't do nocow check unless we have to
  btrfs: fix deadlock in delayed_ref_async_start
  Btrfs: track transid for delayed ref flushing
2016-06-25 08:42:31 -07:00
Chandan Rajendra
b7f67055d2 Btrfs: Force stripesize to the value of sectorsize
Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23 10:44:42 -07:00
Linus Torvalds
4c6459f945 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The most user visible change here is a fix for our recent superblock
  validation checks that were causing problems on non-4k pagesized
  systems"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize
  btrfs: remove build fixup for qgroup_account_snapshot
  btrfs: use new error message helper in qgroup_account_snapshot
  btrfs: avoid blocking open_ctree from cleaner_kthread
  Btrfs: don't BUG_ON() in btrfs_orphan_add
  btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
  Btrfs: check if extent buffer is aligned to sectorsize
  btrfs: Use correct format specifier
2016-06-18 05:57:59 -10:00
Liu Bo
c871b0f2fd Btrfs: check if extent buffer is aligned to sectorsize
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer().  An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17 18:32:40 +02:00
Linus Torvalds
3d0f0b6a55 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Has some fixes and some new self tests for btrfs.  The self tests are
  usually disabled in the .config file (unless you're doing btrfs dev
  work), and this bunch is meant to find problems with the 64K page size
  patches.

  Jeff has a patch to help people see if they are using the hardware
  assist crc32c module, which really helps us nail down problems when
  people ask why crcs are using so much CPU.

  Otherwise, it's small fixes"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system
  Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize
  Btrfs: self-tests: Use macros instead of constants and add missing newline
  Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes
  Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE
  btrfs: advertise which crc32c implementation is being used at module load
  Btrfs: add validadtion checks for chunk loading
  Btrfs: add more validation checks for superblock
  Btrfs: clear uptodate flags of pages in sys_array eb
  Btrfs: self-tests: Support non-4k page size
  Btrfs: Fix integer overflow when calculating bytes_per_bitmap
  Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
  Btrfs: end transaction if we abort when creating uuid root
  btrfs: Use __u64 in exported linux/btrfs.h.
2016-06-10 14:13:27 -07:00
Chris Mason
719da39a61 Merge branch 'misc-fixes-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7 2016-06-08 14:36:12 -07:00
Mike Christie
81a75f6781 btrfs: use bio fields for op and flags
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
b3d3fa5199 btrfs: update __btrfs_map_block for REQ_OP transition
We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block.
It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS,
so this drops the bit tests.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
37226b2111 btrfs: use bio op accessors
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie
4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Liu Bo
e06cd3dd7c Btrfs: add validadtion checks for chunk loading
To prevent fuzzed filesystem images from panic the whole system,
we need various validation checks to refuse to mount such an image
if btrfs finds any invalid value during loading chunks, including
both sys_array and regular chunks.

Note that these checks may not be sufficient to cover all corner cases,
feel free to add more checks.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 10:57:09 +02:00
Liu Bo
99e3ecfcb9 Btrfs: add more validation checks for superblock
This adds validation checks for super_total_bytes, super_bytes_used and
super_stripesize, super_num_devices.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 10:41:53 +02:00
Liu Bo
d865177a5e Btrfs: clear uptodate flags of pages in sys_array eb
We set uptodate flag to pages in the temporary sys_array eb,
but do not clear the flag after free eb.  As the special
btree inode may still hold a reference on those pages, the
uptodate flag can remain alive in them.

If btrfs_super_chunk_root has been intentionally changed to the
offset of this sys_array eb, reading chunk_root will read content
of sys_array and it will skip our beautiful checks in
btree_readpage_end_io_hook() because of
"pages of eb are uptodate => eb is uptodate"

This adds the 'clear uptodate' part to force it to read from disk.

Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06 10:14:40 +02:00
Linus Torvalds
b2d5ad8223 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The important part of this pull is Filipe's set of fixes for btrfs
  device replacement.  Filipe fixed a few issues seen on the list and a
  number he found on his own"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
  Btrfs: fix race between device replace and read repair
  Btrfs: fix race between device replace and discard
  Btrfs: fix race between device replace and chunk allocation
  Btrfs: fix race setting block group back to RW mode during device replace
  Btrfs: fix unprotected assignment of the left cursor for device replace
  Btrfs: fix race setting block group readonly during device replace
  Btrfs: fix race between device replace and block group removal
  Btrfs: fix race between readahead and device replace/removal
2016-06-04 11:56:28 -07:00
Josef Bacik
65d4f4c151 Btrfs: end transaction if we abort when creating uuid root
We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
otherwise we hang and make me super grumpy.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-01 00:32:42 +02:00
Filipe Manana
22ab04e814 Btrfs: fix race between device replace and chunk allocation
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.

However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.

For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:

  [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
  3Gb             4Gb                         5Gb

While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:

        CPU 1                                            CPU 2

                      <we are at transaction N>

  scrub_enumerate_chunks()
    --> searches the device tree for
        extents belonging to the source
        device using the device tree's
        commit root
    --> 1st iteration finds extent belonging to
        block group A

        --> sets block group A to RO mode
            (btrfs_inc_block_group_ro)

        --> sets cursor left to found_key.offset
            which is 3Gb

        --> scrub_chunk() starts
            copies all allocated extents from
            block group's A stripe at source
            device into target device

                                                           btrfs_alloc_chunk()
                                                             --> allocates device extent
                                                                 in the range [4Gb, 5Gb[
                                                                 from the source device for
                                                                 a new block group C

                                                           extent allocated from block
                                                           group C for a direct IO,
                                                           buffered write or btree node/leaf

                                                           extent is written to, perhaps
                                                           in response to a writepages()
                                                           call from the VM or directly
                                                           through direct IO

                                                           the write is made only against
                                                           the source device and not against
                                                           the target device because the
                                                           extent's offset is in the interval
                                                           [4Gb, 5Gb[ which is larger then
                                                           the value of cursor_left (3Gb)

        --> scrub_chunks() finishes

        --> updates left cursor from 3Gb to
            4Gb

        --> btrfs_dec_block_group_ro() sets
            block group A back to RW mode

                             <we are still at transaction N>

    --> 2nd iteration finds extent belonging to
        block group B - it did not find the new
        extent in the range [4Gb, 5Gb[ for block
        group C because we are using the device
        tree's commit root or even because the
        block group's items are not all yet
        inserted in the respective btrees, that is,
        the block group is still attached to some
        transaction handle's new_bgs list and
        btrfs_create_pending_block_groups() was
        not called yet against that transaction
        handle, so the device extent items were
        not yet inserted into the devices tree

                             <we are still at transaction N>

        --> so we end not copying anything from the newly
            allocated device extent from the source device
            to the target device

So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:26 +01:00
Filipe Manana
57ba4cb85b Btrfs: fix race between device replace and block group removal
When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:

[ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
[ 9238.594377] ------------[ cut here ]------------
[ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594403] BTRFS: Transaction aborted (error 1)
[ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
oppy [last unloaded: btrfs]
[ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 9238.594421]  0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
[ 9238.594422]  0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
[ 9238.594423]  0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
[ 9238.594424] Call Trace:
[ 9238.594428]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
[ 9238.594430]  [<ffffffff81052b14>] __warn+0xc2/0xdd
[ 9238.594432]  [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
[ 9238.594434]  [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
[ 9238.594450]  [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594452]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[ 9238.594464]  [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
[ 9238.594476]  [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
[ 9238.594489]  [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
[ 9238.594490]  [<ffffffff8106f403>] kthread+0xd4/0xdc
[ 9238.594494]  [<ffffffff8149e242>] ret_from_fork+0x22/0x40
[ 9238.594495]  [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
[ 9238.594496] ---[ end trace 183efbe50275f059 ]---

The sequence of steps leading to this is like the following:

              CPU 1                                           CPU 2

 btrfs_dev_replace_finishing()

   at this point
   dev_replace->tgtdev->devid ==
   BTRFS_DEV_REPLACE_DEVID (0ULL)

   ...

   btrfs_start_transaction()
   btrfs_commit_transaction()

                                                     btrfs_delete_unused_bgs()
                                                       btrfs_remove_chunk()

                                                         looks up for the extent map
                                                         corresponding to the chunk

                                                         lock_chunks() (chunk_mutex)
                                                         check_system_chunk()
                                                         unlock_chunks() (chunk_mutex)

   locks fs_info->chunk_mutex

   btrfs_dev_replace_update_device_in_mapping_tree()
     --> iterates fs_info->mapping_tree and
         replaces the device in every extent
         map's map->stripes[] with
         dev_replace->tgtdev, which still has
         an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)

                                                         iterates over all stripes from
                                                         the extent map

                                                           --> calls btrfs_free_dev_extent()
                                                               passing it the target device
                                                               that still has an ID of 0ULL

                                                           --> btrfs_free_dev_extent() fails
                                                             --> aborts current transaction

   finishes setting up the target device,
   namely it sets tgtdev->devid to the value
   of srcdev->devid (which is necessarily > 0)

   frees the srcdev

   unlocks fs_info->chunk_mutex

So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit 50460e3718
("Btrfs: fix race when finishing dev replace leading to transaction abort").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:19 +01:00
Linus Torvalds
559b6d90a0 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "We have another round of fixes and a few cleanups.

  I have a fix for short returns from btrfs_copy_from_user, which
  finally nails down a very hard to find regression we added in v4.6.

  Dave is pushing around gfp parameters, mostly to cleanup internal apis
  and make it a little more consistent.

  The rest are smaller fixes, and one speelling fixup patch"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
  Btrfs: fix handling of faults from btrfs_copy_from_user
  btrfs: fix string and comment grammatical issues and typos
  btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
  Btrfs: fix unexpected return value of fiemap
  Btrfs: free sys_array eb as soon as possible
  btrfs: sink gfp parameter to convert_extent_bit
  btrfs: make state preallocation more speculative in __set_extent_bit
  btrfs: untangle gotos a bit in convert_extent_bit
  btrfs: untangle gotos a bit in __clear_extent_bit
  btrfs: untangle gotos a bit in __set_extent_bit
  btrfs: sink gfp parameter to set_record_extent_bits
  btrfs: sink gfp parameter to set_extent_new
  btrfs: sink gfp parameter to set_extent_defrag
  btrfs: sink gfp parameter to set_extent_delalloc
  btrfs: sink gfp parameter to clear_extent_dirty
  btrfs: sink gfp parameter to clear_record_extent_bits
  btrfs: sink gfp parameter to clear_extent_bits
  btrfs: sink gfp parameter to set_extent_bits
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: make find_workspace always succeed
  ...
2016-05-27 16:37:36 -07:00
David Sterba
42f31734eb Merge branch 'cleanups-4.7' into for-chris-4.7-20160525 2016-05-25 22:51:03 +02:00
Nicholas D Steeves
0132761017 btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 22:35:14 +02:00
Liu Bo
1c8b5b6e8b Btrfs: free sys_array eb as soon as possible
While reading sys_chunk_array in superblock, btrfs creates a temporary
extent buffer.  Since we don't use it after finishing reading
 sys_chunk_array, we don't need to keep it in memory.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25 19:53:51 +02:00
Linus Torvalds
07be1337b9 Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has our merge window series of cleanups and fixes.  These target
  a wide range of issues, but do include some important fixes for
  qgroups, O_DIRECT, and fsync handling.  Jeff Mahoney moved around a
  few definitions to make them easier for userland to consume.

  Also whiteout support is included now that issues with overlayfs have
  been cleared up.

  I have one more fix pending for page faults during btrfs_copy_from_user,
  but I wanted to get this bulk out the door first"

* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits)
  btrfs: fix memory leak during RAID 5/6 device replacement
  Btrfs: add semaphore to synchronize direct IO writes with fsync
  Btrfs: fix race between block group relocation and nocow writes
  Btrfs: fix race between fsync and direct IO writes for prealloc extents
  Btrfs: fix number of transaction units for renames with whiteout
  Btrfs: pin logs earlier when doing a rename exchange operation
  Btrfs: unpin logs if rename exchange operation fails
  Btrfs: fix inode leak on failure to setup whiteout inode in rename
  btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
  Btrfs: pin log earlier when renaming
  Btrfs: unpin log if rename operation fails
  Btrfs: don't do unnecessary delalloc flushes when relocating
  Btrfs: don't wait for unrelated IO to finish before relocation
  Btrfs: fix empty symlink after creating symlink and fsync parent dir
  Btrfs: fix for incorrect directory entries after fsync log replay
  btrfs: build fixup for qgroup_account_snapshot
  btrfs: qgroup: Fix qgroup accounting when creating snapshot
  Btrfs: fix fspath error deallocation
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: make find_workspace always succeed
  ...
2016-05-21 10:49:22 -07:00
Andy Shevchenko
8da4b8c48e lib/uuid.c: move generate_random_uuid() to uuid.c
Let's gather the UUID related functions under one hood.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
David Sterba
36fac9e9ff Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516 2016-05-16 15:46:26 +02:00
David Sterba
5ef64a3e75 Merge branch 'cleanups-4.7' into for-chris-4.7-20160516 2016-05-16 15:46:24 +02:00
Anand Jain
48b3b9d401 btrfs: fix lock dep warning move scratch super outside of chunk_mutex
Move scratch super outside of the chunk lock to avoid below
lockdep warning. The better place to scratch super is in
the function btrfs_rm_dev_replace_free_srcdev() just before
free_device, which is outside of the chunk lock as well.

To reproduce:
  (fresh boot)
  mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdd /dev/sde
  mount /dev/sdc /btrfs
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  (get devmgt from https://github.com/asj/devmgt.git)
  devmgt detach /dev/sde
  dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100
  sync
  btrfs replace start -Brf 3 /dev/sdf /btrfs <--
  devmgt attach host7

======================================================
[ INFO: possible circular locking dependency detected ]
4.6.0-rc2asj+ #1 Not tainted
---------------------------------------------------

btrfs/2174 is trying to acquire lock:
(sb_writers){.+.+.+}, at:
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0

but task is already holding lock:
(&fs_info->chunk_mutex){+.+.+.}, at:
[<ffffffffa05c5f55>] btrfs_dev_replace_finishing+0x145/0x980 [btrfs]

which lock already depends on the new lock.

Chain exists of:
sb_writers --> &fs_devs->device_list_mutex --> &fs_info->chunk_mutex
Possible unsafe locking scenario:
CPU0				CPU1
----				----
lock(&fs_info->chunk_mutex);
				lock(&fs_devs->device_list_mutex);
				lock(&fs_info->chunk_mutex);
lock(sb_writers);

*** DEADLOCK ***

-> #0 (sb_writers){.+.+.+}:
[<ffffffff810e6415>] __lock_acquire+0x1bc5/0x1ee0
[<ffffffff810e707e>] lock_acquire+0xbe/0x210
[<ffffffff810df49a>] percpu_down_read+0x4a/0xa0
[<ffffffff812449b4>] __sb_start_write+0xb4/0xf0
[<ffffffff81265534>] mnt_want_write+0x24/0x50
[<ffffffff812508a2>] path_openat+0x952/0x1190
[<ffffffff81252451>] do_filp_open+0x91/0x100
[<ffffffff8123f5cc>] file_open_name+0xfc/0x140
[<ffffffff8123f643>] filp_open+0x33/0x60
[<ffffffffa0572bb6>] update_dev_time+0x16/0x40 [btrfs]
[<ffffffffa057f60d>] btrfs_scratch_superblocks+0x5d/0xb0 [btrfs]
[<ffffffffa057f70e>] btrfs_rm_dev_replace_remove_srcdev+0xae/0xd0 [btrfs]
[<ffffffffa05c62c5>] btrfs_dev_replace_finishing+0x4b5/0x980 [btrfs]
[<ffffffffa05c6ae8>] btrfs_dev_replace_start+0x358/0x530 [btrfs]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Josef Bacik
e042d1ec44 Btrfs: remove BUG_ON()'s in btrfs_map_block
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ removed type casts ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Liu Bo
3d8da67817 Btrfs: fix divide error upon chunk's stripe_len
The struct 'map_lookup' uses type int for @stripe_len, while
btrfs_chunk_stripe_len() can return a u64 value, and it may end up with
@stripe_len being undefined value and it can lead to 'divide error' in
 __btrfs_map_block().

This changes 'map_lookup' to use type u64 for stripe_len, also right now
we only use BTRFS_STRIPE_LEN for stripe_len, so this adds a valid checker for
BTRFS_STRIPE_LEN.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ folded division fix to scrub_raid56_parity ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain
779bf3fefa btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex
When the replace target fails, the target device will be taken
out of fs device list, scratch + update_dev_time and freed. However
we could do the scratch  + update_dev_time and free part after the
device has been taken out of device list, so that we don't have to
hold the device_list_mutex and uuid_mutex locks.

Reported issue:

[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862]
[ 5375.718862] but task is already holding lock:
[ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907]
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907]
[ 5375.718908]
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911]
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968]
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991]
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015]
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230]
[ 5375.719230] other info that might help us debug this:
[ 5375.719230]
[ 5375.719233] Chain exists of:
[ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233]
[ 5375.719234]  Possible unsafe locking scenario:
[ 5375.719234]
[ 5375.719234]        CPU0                    CPU1
[ 5375.719235]        ----                    ----
[ 5375.719236]   lock(&fs_devs->device_list_mutex);
[ 5375.719238]                                lock(namespace_sem);
[ 5375.719239]                                lock(&fs_devs->device_list_mutex);
[ 5375.719241]   lock(sb_writers);
[ 5375.719241]
[ 5375.719241]  *** DEADLOCK ***
[ 5375.719241]
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343]
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Austin S. Hemmelgarn
88be159c90 btrfs: allow balancing to dup with multi-device
Currently, we don't allow the user to try and rebalance to a dup profile
on a multi-device filesystem.  In most cases, this is a perfectly sensible
restriction as raid1 uses the same amount of space and provides better
protection.

However, when reshaping a multi-device filesystem down to a single device
filesystem, this requires the user to convert metadata and system chunks
to single profile before deleting devices, and then convert again to dup,
which leaves a period of time where metadata integrity is reduced.

This patch removes the single-device-only restriction from converting to
dup profile to remove this potential data integrity reduction.

Signed-off-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-06 15:22:49 +02:00
Anand Jain
88acff64c6 btrfs: cleanup assigning next active device with a check
Creates helper fucntion as needed by the device delete
and replace operations. Also now it checks if the next
device being assigned is an active device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 10:41:08 +02:00
Anand Jain
8ed01abe7d btrfs: s_bdev is not null after missing replace
Yauhen reported in the ML that s_bdev is null at mount, and
s_bdev gets updated to some device when missing device is
replaced, as because bdev is null for missing device, things
gets matched up. Fix this by checking if s_bdev is set. I
didn't want to completely remove updating s_bdev because
the future multi device support at vfs layer may need it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-04 09:52:44 +02:00
David Sterba
5c5c0df05d btrfs: rename btrfs_find_device_by_user_input
For clarity how we are going to find the device, let's call it a device
specifier, devspec for short. Also rename the arguments that are a
leftover from previous function purpose.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
418775a22b btrfs: use existing device constraints table btrfs_raid_array
We should avoid duplicating the device constraints, let's use the
btrfs_raid_array in btrfs_check_raid_min_devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
621292bae6 btrfs: introduce raid-type to error-code table, for minimum device constraint
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
3cc31a0d5b btrfs: pass number of devices to btrfs_check_raid_min_devices
Before this patch, btrfs_check_raid_min_devices would do an off-by-one
check of the constraints and not the miminmum check, as its name
suggests. This is not a problem if the only caller is device remove, but
would be confusing for others.

Add an argument with the exact number and let the caller(s) decide if
this needs any adjustments, like when device replace is running.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
David Sterba
f47ab2588e btrfs: rename __check_raid_min_devices
Underscores are for special functions, use the full prefix for better
stacktrace recognition.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
02feae3c55 btrfs: optimize check for stale device
Optimize check for stale device to only be checked when there is device
added or changed. If there is no update to the device, there is no need
to call btrfs_free_stale_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
6b526ed70c btrfs: introduce device delete by devid
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct
btrfs_ioctl_vol_args_v2 to carry devid as an user argument.

The patch won't delete the old ioctl interface and so kernel remains
backward compatible with user land progs.

Test case/script:
echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
mount /dev/sdd /btrfs
dmsetup suspend bad_disk
echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
dmsetup resume bad_disk
echo "bad disk failed. now deleting/replacing"
btrfs dev del  3  /btrfs
echo $?
btrfs fi show /btrfs
umount /btrfs
btrfs-show-super /dev/sdd | egrep num_device
dmsetup remove bad_disk
wipefs -a /dev/sdf

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Martin <m_btrfs@ml1.co.uk>
[ adjust messages, s/disk/device/ ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
42b6742715 btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()
With the previous patches now the btrfs_scratch_superblocks() is ready to
be used in btrfs_rm_device() so use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
b3d1b1532f btrfs: enhance btrfs_find_device_by_user_input() to check device path
The operation of device replace and device delete follows same steps upto
some depth with in btrfs kernel, however they don't share codes. This
enhancement will help replace and delete to share codes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
24fc572fe4 btrfs: make use of btrfs_find_device_by_user_input()
btrfs_rm_device() has a section of the code which can be replaced
btrfs_find_device_by_user_input()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
24e0474b59 btrfs: create helper btrfs_find_device_by_user_input()
The patch renames btrfs_dev_replace_find_srcdev() to
btrfs_find_device_by_user_input() and moves it to volumes.c, so that
delete device can use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
bd45ffbcb1 btrfs: clean up and optimize __check_raid_min_device()
__check_raid_min_device() which was pealed from btrfs_rm_device()
maintianed its original code to show the block move. This patch cleans up
__check_raid_min_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
f1fa7f2642 btrfs: create helper function __check_raid_min_devices()
move a section of btrfs_rm_device() code to check for min number of the
devices into the function __check_raid_min_devices()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:13 +02:00
Anand Jain
6cf86a006b btrfs: create a helper function to read the disk super
A part of code from btrfs_scan_one_device() is moved to a new function
btrfs_read_disk_super(), so that former function looks cleaner. (In this
process it also moves the code which ensures null terminating label). So
this creates easy opportunity to merge various duplicate codes on read
disk super. Earlier attempt to merge duplicate codes highlighted that
there were some issues for which there are duplicate codes (to read disk
super), however it was not clear what was the issue. So until we figure
that out, its better to keep them in a separate functions.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:59:04 +02:00
Liu Bo
cf25ce518e Btrfs: do not create empty block group if we have allocated data
Now we force to create empty block group to keep data profile alive,
however, in the below example, we eventually get an empty block group
while we're trying to get more space for other types (metadata/system),

- Before,
block group "A": size=2G, used=1.2G
block group "B": size=2G, used=512M

- After "btrfs balance start -dusage=50 mount_point",
block group "A": size=2G, used=(1.2+0.5)G
block group "C": size=2G, used=0

Since there is no data in block group C, it won't be deleted
automatically and we have to get the unused 2G until the next mount.

Balance itself just moves data and doesn't remove data, so it's safe
to not create such a empty block group if we already have data
 allocated in other block groups.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:41:47 +02:00
Anand Jain
34d9700702 btrfs: rename btrfs_std_error to btrfs_handle_fs_error
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ edit changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-04-28 10:36:54 +02:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Adam Buchbinder
bb7ab3b92e btrfs: Fix misspellings in comments.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-14 15:05:02 +01:00
David Sterba
675d276b32 Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6 2016-02-26 15:38:32 +01:00
David Sterba
67d605fec1 Merge branch 'dev/rename-keys' into for-chris-4.6 2016-02-26 15:38:29 +01:00
Liu Bo
73beece9ca Btrfs: fix lockdep deadlock warning due to dev_replace
Xfstests btrfs/011 complains about a deadlock warning,

[ 1226.649039] =========================================================
[ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
[ 1226.649039] 4.1.0+ #270 Not tainted
[ 1226.649039] ---------------------------------------------------------
[ 1226.652955] kswapd0/46 just changed the state of lock:
[ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1226.652955]
other info that might help us debug this:
[ 1226.652955] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1226.652955]  Possible interrupt unsafe locking scenario:

[ 1226.652955]        CPU0                    CPU1
[ 1226.652955]        ----                    ----
[ 1226.652955]   lock(&fs_info->dev_replace.lock);
[ 1226.652955]                                local_irq_disable();
[ 1226.652955]                                lock(&delayed_node->mutex);
[ 1226.652955]                                lock(&found->groups_sem);
[ 1226.652955]   <Interrupt>
[ 1226.652955]     lock(&delayed_node->mutex);
[ 1226.652955]
 *** DEADLOCK ***

Commit 084b6e7c76 ("btrfs: Fix a lockdep warning when running xfstest.") tried
to fix a similar one that has the exactly same warning, but with that, we still
run to this.

The above lock chain comes from
btrfs_commit_transaction
  ->btrfs_run_delayed_items
    ...
    ->__btrfs_update_delayed_inode
      ...
      ->__btrfs_cow_block
         ...
         ->find_free_extent
            ->cache_block_group
              ->load_free_space_cache
                ->btrfs_readpages
                  ->submit_one_bio
                    ...
                    ->__btrfs_map_block
                      ->btrfs_dev_replace_lock

However, with high memory pressure, tasks which hold dev_replace.lock can
be interrupted by kswapd and then kswapd is intended to release memory occupied
by superblock, inodes and dentries, where we may call evict_inode, and it comes
to

[ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
[ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
[ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700

delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
to a ABBA deadlock.

To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
things are simpler here since we only needs read's spinlock to blocking lock.

With this, btrfs/011 no more produces warnings in dmesg.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-23 13:10:10 +01:00
David Sterba
242e2956e4 btrfs: switch dev stats item to the permanent item key
Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
c479cb4f14 btrfs: switch balance item to the temporary item key
No visible change.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 16:15:43 +01:00
David Sterba
78f2c9e6db btrfs: device add and remove: use GFP_KERNEL
We can safely use GFP_KERNEL in the functions called from the ioctl
handlers.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-11 15:19:39 +01:00
Zhao Lei
ad1ba2a0c4 btrfs: Use direct way to determine raid56 write/recover mode
Old code used bbio->raid_map to determine whether in raid56
write/recover operation, because we didn't't have bbio->map_type.

Now we have direct way for this condition, rid of using
the function-relative data, and make the code more readable.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:43:45 -08:00
Zhao Lei
94a97dfeb6 btrfs: Small cleanup for get index_srcdev loop
1: Adjust condition in loop to make less TAB
2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:43:40 -08:00
Qu Wenruo
f04b772bfc btrfs: Enhance chunk validation check
Enhance chunk validation:
1) Num_stripes
   We already have such check but it's only in super block sys chunk
   array.
   Now check all on-disk chunks.

2) Chunk logical
   It should be aligned to sector size.
   This behavior should be *DOUBLE CHECKED* for 64K sector size like
   PPC64 or AArch64.
   Maybe we can found some hidden bugs.

3) Chunk length
   Same as chunk logical, should be aligned to sector size.

4) Stripe length
   It should be power of 2.

5) Chunk type
   Any bit out of TYPE_MAS | PROFILE_MASK is invalid.

With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:41 -08:00
Filipe Manana
fedc00455c Btrfs: fix typo in log message when starting a balance
The recent change titled "Btrfs: Check metadata redundancy on balance"
(already in linux-next) left a typo in a message for users:
metatdata -> metadata.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19 18:21:40 -08:00
Chris Mason
326f784281 Merge branch 'misc-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-19 18:21:30 -08:00
Chris Mason
acc308556c Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-19 18:21:00 -08:00
Colin Ian King
fb75d857a3 btrfs: remove duplicate const specifier
duplicate const is redundant so remove it

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-19 10:33:56 +01:00
Sebastian Andrzej Siewior
546bed6312 btrfs: initialize the seq counter in struct btrfs_device
I managed to trigger this:
| INFO: trying to register non-static key.
| the code is fine but needs lockdep annotation.
| turning off the locking correctness validator.
| CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
| Hardware name: ARM-Versatile Express
| [<80307cec>] (dump_stack)
| [<80070e98>] (__lock_acquire)
| [<8007184c>] (lock_acquire)
| [<80287800>] (btrfs_ioctl)
| [<8012a8d4>] (do_vfs_ioctl)
| [<8012ac14>] (SyS_ioctl)

so I think that btrfs_device_data_ordered_init() is not invoked behind
a macro somewhere.

Fixes: 7cc8e58d53 ("Btrfs: fix unprotected device's variants on 32bits machine")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:28:43 +01:00
Jeff Mahoney
95617d6932 btrfs: cleanup, stop casting for extent_map->lookup everywhere
Overloading extent_map->bdev to struct map_lookup * might have started out
as a means to an end, but it's a pattern that's used all over the place
now. Let's get rid of the casting and just add a union instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15 19:22:28 +01:00
Chris Mason
988f1f576d Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11 08:39:28 -08:00
Chris Mason
b28cf57246 Merge branch 'misc-cleanups-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11 06:08:37 -08:00
Chris Mason
a3058101c1 Merge branch 'misc-for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2016-01-11 05:59:32 -08:00
Filipe Manana
8cdc7c5b00 Btrfs: fix fitrim discarding device area reserved for boot loader's use
As of the 4.3 kernel release, the fitrim ioctl can now discard any region
of a disk that is not allocated to any chunk/block group, including the
first megabyte which is used for our primary superblock and by the boot
loader (grub for example).

Fix this by not allowing to trim/discard any region in the device starting
with an offset not greater than min(alloc_start_mount_option, 1Mb), just
as it was not possible before 4.3.

A reproducer test case for xfstests follows.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1	# failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      cd /
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1

  # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
  # reserved for a boot loader to use (GRUB for example) and btrfs should never
  # use them - neither for allocating metadata/data nor should trim/discard them.
  # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
  $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
  $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io

  # Now mount the filesystem and perform a fitrim against it.
  _scratch_mount
  _require_batched_discard $SCRATCH_MNT
  $FSTRIM_PROG $SCRATCH_MNT

  # Now unmount the filesystem and verify the content of the ranges was not
  # modified (no trim/discard happened on them).
  _scratch_unmount
  echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
  od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
  od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV

  status=0
  exit

Reported-by: Vincent Petry  <PVince81@yahoo.fr>
Reported-by: Andrei Borzenkov <arvidjaar@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
Fixes: 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2016-01-07 21:16:03 +00:00
Sam Tygier
ee592d0771 Btrfs: Check metadata redundancy on balance
When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single

Signed-off-by: Sam Tygier <samtygier@yahoo.co.uk>
[ minor message reformatting ]
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:20:56 +01:00
David Sterba
e4058b54d1 btrfs: cleanup, use enum values for btrfs_path reada
Replace the integers by enums for better readability. The value 2 does
not have any meaning since a717531942
"Btrfs: do less aggressive btree readahead" (2009-01-22).

Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 15:01:15 +01:00
Byongho Lee
ee22184b53 Btrfs: use linux/sizes.h to represent constants
We use many constants to represent size and offset value.  And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'.  However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:38:02 +01:00
David Sterba
7928d672ff btrfs: cleanup, remove stray return statements
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:30:52 +01:00
David Sterba
93a3d46780 btrfs: verbose error when we find an unexpected item in sys_array
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
David Sterba
f5cdedd73f btrfs: handle invalid num_stripes in sys_array
We can handle the special case of num_stripes == 0 directly inside
btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
catch other unhandled cases where we fail to validate external data.

A crafted or corrupted image crashes at mount time:

BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
BTRFS info (device loop0): disk space caching is enabled
BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
Stack:
 637af890 60062489 602aeb2e 604192ba
 60387961 00000011 637af8a0 6038a835
 637af9c0 6038776b 634ef32b 00000000
Call Trace:
 [<6001c86d>] show_stack+0xfe/0x15b
 [<6038a835>] dump_stack+0x2a/0x2c
 [<6038776b>] panic+0x13e/0x2b3
 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
 [<601cfbbe>] open_ctree+0x192d/0x27af
 [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
 [<600bc9a7>] mount_fs+0x11/0xf3
 [<600d5167>] vfs_kern_mount+0x75/0x11a
 [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
 [<600bc9a7>] mount_fs+0x11/0xf3
 [<600d5167>] vfs_kern_mount+0x75/0x11a
 [<600d710b>] do_mount+0xa35/0xbc9
 [<600d7557>] SyS_mount+0x95/0xc8
 [<6001e884>] handle_syscall+0x6b/0x8e

Reported-by: Jiri Slaby <jslaby@suse.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
CC: stable@vger.kernel.org	# 3.19+
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Zhao Lei
c5ca87819d btrfs: Support convert to -d dup for btrfs-convert
Since we will add support for -d dup for non-mixed filesystem,
kernel need to support converting to this raid-type.

This patch remove limitation of above case.

Tested by following script:
(combination of dup conversion with fsck):

export TEST_DEV='/dev/vdc'
export TEST_DIR='/var/ltf/tester/mnt'

do_dup_test()
{
    local m_from="$1"
    local d_from="$2"
    local m_to="$3"
    local d_to="$4"

    echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"

    umount "$TEST_DIR" &>/dev/null
    ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
    mount "$TEST_DEV" "$TEST_DIR" || return 1

    cp -a /sbin/* "$TEST_DIR"

    [[ "$m_from" != "$m_to" ]] && {
        ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
    }

    [[ "$d_from" != "$d_to" ]] && {
	local opt=()
	[[ "$d_to" == single ]] && opt+=("-f")
        ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
    }

    umount "$TEST_DIR" || return 1
    ./btrfsck "$TEST_DEV" || return 1
    echo

    return 0
}

test_all()
{
    for m_from in single dup; do
    for d_from in single dup; do
    for m_to in single dup; do
    for d_to in single dup; do
    do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
    done
    done
    done
    done
}

test_all

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07 14:26:58 +01:00
Chris Mason
140e639f1a btrfs: fix warning on uninit variable in btrfs_finish_chunk_alloc
map->num_stripes really can't be zero, but just in case.

Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23 13:30:51 -08:00
Chris Mason
a53fe25769 Merge branch 'for-chris-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.5 2015-12-23 13:28:35 -08:00
Chris Mason
afa427cf9d Merge branch 'cleanup/misc-simplify' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 2015-12-23 13:10:26 -08:00
Linus Torvalds
fc315e3e5c Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "A couple of small fixes"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: check prepare_uptodate_page() error code earlier
  Btrfs: check for empty bitmap list in setup_cluster_bitmaps
  btrfs: fix misleading warning when space cache failed to load
  Btrfs: fix transaction handle leak in balance
  Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list
2015-12-18 15:35:08 -08:00
Filipe Manana
50460e3718 Btrfs: fix race when finishing dev replace leading to transaction abort
During the final phase of a device replace operation, I ran into a
transaction abort that resulted in the following trace:

[23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
[23919.664742] BTRFS: Transaction aborted (error -2)
[23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
[23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
[23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
[23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
[23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
[23919.696716] Call Trace:
[23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
[23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
[23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
[23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
[23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
[23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
[23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
[23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
[23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
[23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
[23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
[23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
[23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
[23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
[23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
[23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
[23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
[23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[23919.733330] ---[ end trace 166ef301a335832a ]---

This is due to a race between device replace and chunk allocation, which
the following diagram illustrates:

         CPU 1                                    CPU 2

 btrfs_dev_replace_finishing()

   at this point
    dev_replace->tgtdev->devid ==
    BTRFS_DEV_REPLACE_DEVID (0ULL)

   ...

   btrfs_start_transaction()
   btrfs_commit_transaction()

                                               btrfs_fallocate()
                                                 btrfs_alloc_data_chunk_ondemand()
                                                   btrfs_join_transaction()
                                                     --> starts a new transaction
                                                   do_chunk_alloc()
                                                     lock fs_info->chunk_mutex
                                                       btrfs_alloc_chunk()
                                                         --> creates extent map for
                                                             the new chunk with
                                                             em->bdev->map->stripes[i]->dev->devid
                                                             == X (X > 0)
                                                         --> extent map is added to
                                                             fs_info->mapping_tree
                                                         --> initial phase of bg A
                                                             allocation completes
                                                     unlock fs_info->chunk_mutex

   lock fs_info->chunk_mutex

   btrfs_dev_replace_update_device_in_mapping_tree()
     --> iterates fs_info->mapping_tree and
         replaces the device in every extent
         map's map->stripes[] with
         dev_replace->tgtdev, which still has
         an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)

                                                   btrfs_end_transaction()
                                                     btrfs_create_pending_block_groups()
                                                       --> starts final phase of
                                                           bg A creation (update device,
                                                           extent, and chunk trees, etc)
                                                       btrfs_finish_chunk_alloc()

                                                         btrfs_update_device()
                                                           --> attempts to update a device
                                                               item with ID == 0ULL
                                                               (BTRFS_DEV_REPLACE_DEVID)
                                                               which is the current ID of
                                                               bg A's
                                                               em->bdev->map->stripes[i]->dev->devid
                                                           --> doesn't find such item
                                                               returns -ENOENT
                                                           --> the device id should have been X
                                                               and not 0ULL

                                                       got -ENOENT from
                                                       btrfs_finish_chunk_alloc()
                                                       and aborts current transaction

   finishes setting up the target device,
   namely it sets tgtdev->devid to the value
   of srcdev->devid, which is X (and X > 0)

   frees the srcdev

   unlock fs_info->chunk_mutex

So fix this by taking the device list mutex when processing the chunk's
extent map stripes to update the device items. This avoids getting the
wrong device id and use-after-free problems if the task finishing a
chunk allocation grabs the replaced device, which is freed while the
dev replace task is holding the device list mutex.

This happened while running fstest btrfs/071.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-12-17 10:59:46 +00:00
Filipe Manana
8a7d656f3d Btrfs: fix transaction handle leak in balance
If we fail to allocate a new data chunk, we were jumping to the error path
without release the transaction handle we got before. Fix this by always
releasing it before doing the jump.

Fixes: 2c9fe83552 ("btrfs: Fix lost-data-profile caused by balance bg")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-12-10 11:23:24 +00:00
David Sterba
4db8c528cd btrfs: remove a trivial helper btrfs_set_buffer_uptodate
Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-07 15:06:45 +01:00
David Sterba
87ad58c5f0 btrfs: make btrfs_close_one_device static
Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03 15:02:21 +01:00
Linus Torvalds
80e0c505b2 Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has Mark Fasheh's patches to fix quota accounting during subvol
  deletion, which we've been working on for a while now.  The patch is
  pretty small but it's a key fix.

  Otherwise it's a random assortment"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix balance range usage filters in 4.4-rc
  btrfs: qgroup: account shared subtree during snapshot delete
  Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
  btrfs: qgroup: fix quota disable during rescan
  Btrfs: fix race between cleaner kthread and space cache writeout
  Btrfs: fix scrub preventing unused block groups from being deleted
  Btrfs: fix race between scrub and block group deletion
  btrfs: fix rcu warning during device replace
  btrfs: Continue replace when set_block_ro failed
  btrfs: fix clashing number of the enhanced balance usage filter
  Btrfs: fix the number of transaction units needed to remove a block group
  Btrfs: use global reserve when deleting unused block group after ENOSPC
  Btrfs: tests: checking for NULL instead of IS_ERR()
  btrfs: fix signed overflows in btrfs_sync_file
2015-11-27 15:45:45 -08:00
Holger Hoffstätte
dba72cb30b btrfs: fix balance range usage filters in 4.4-rc
There's a regression in 4.4-rc since commit bc3094673f
(btrfs: extend balance filter usage to take minimum and maximum) in that
existing (non-ranged) balance with -dusage=x no longer works; all chunks
are skipped.

After staring at the code for a while and wondering why a non-ranged
balance would even need min and max thresholds (..which then were not
set correctly, leading to the bug) I realized that the only problem
was the fact that the filter functions were named wrong, thanks to
patching copypasta. Simply renaming both functions lets the existing
btrfs-progs call balance with -dusage=x and now the non-ranged filter
function is invoked, properly using only a single chunk limit.

Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com>
Fixes: bc3094673f ("btrfs: extend balance filter usage to take minimum and maximum")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:27:33 -08:00
David Sterba
31388ab2ed btrfs: fix rcu warning during device replace
The test btrfs/011 triggers a rcu warning
Reviewed-by: Anand Jain <anand.jain@oracle.com>

===============================
[ INFO: suspicious RCU usage. ]
4.4.0-rc1-default+ #286 Tainted: G        W
-------------------------------
fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
4 locks held by btrfs/28786:

0:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
2:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
3:  (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]

stack backtrace:
CPU: 0 PID: 28786 Comm: btrfs Tainted: G        W       4.4.0-rc1-default+ #286
Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
Call Trace:
[<ffffffff8141d47b>] dump_stack+0x4f/0x74
[<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
[<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
[<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
[<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
[<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
[<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
[<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
[<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
[<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
[<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
[<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
[<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
[<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
[<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
[<ffffffff811e3336>] ? __fget_light+0x86/0xb0
[<ffffffff811e3369>] ? __fdget+0x9/0x20
[<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
[<ffffffff811d7483>] SyS_ioctl+0x53/0x80
[<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

This is because of unprotected use of rcu_dereference in
btrfs_scratch_superblocks. We can't add rcu locks around the whole
function because we read the superblock.

The fix will use the rcu string buffer directly without the rcu locking.
Thi is safe as the device will not go away in the meantime. We're
holding the device list mutexes.

Restructuring the code to narrow down the rcu section turned out to be
impossible, we need to call filp_open (through update_dev_time) on the
buffer and this could call kmalloc/__might_sleep. We could call kstrdup
with GFP_ATOMIC but it's not absolutely necessary.

Fixes: 12b1c2637b (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:51 -08:00
Filipe Manana
7fd01182d1 Btrfs: fix the number of transaction units needed to remove a block group
We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 + N units, where N corresponds to the
number of stripes. We were accounting only for the addition of the orphan
item (for the block group's free space cache inode) but we were not
accounting that we need to delete one block group item from the extent
tree, one free space item from the tree of tree roots and N device extent
items from the device tree.

While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible heigth (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing all the items and adding the orphan item.

However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because these items are located in
multiple trees and one of them might have a higher heigth and require
node/leaf splits at many levels, exhausting all the reserved space before
removing all the items and adding the orphan.

In the kernel 4.0 release, commit 3d84be7991 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:

    http://www.spinics.net/lists/linux-btrfs/msg46070.html

So fix this by reserving all the necessary units of metadata.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: 3d84be7991 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Filipe Manana
8eab77ff16 Btrfs: use global reserve when deleting unused block group after ENOSPC
It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25 05:19:50 -08:00
Linus Torvalds
e75cdf9898 Merge branch 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes and cleanups from Chris Mason:
 "Some of this got cherry-picked from a github repo this week, but I
  verified the patches.

  We have three small scrub cleanups and a collection of fixes"

* 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: Use fs_info directly in btrfs_delete_unused_bgs
  btrfs: Fix lost-data-profile caused by balance bg
  btrfs: Fix lost-data-profile caused by auto removing bg
  btrfs: Remove len argument from scrub_find_csum
  btrfs: Reduce unnecessary arguments in scrub_recheck_block
  btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum
  btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum
  btrfs: scrub: setup all fields for sblock_to_check
  btrfs: scrub: set error stats when tree block spanning stripes
  Btrfs: fix race when listing an inode's xattrs
  Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
  Btrfs: fix race leading to incorrect item deletion when dropping extents
  Btrfs: fix sleeping inside atomic context in qgroup rescan worker
  Btrfs: fix race waiting for qgroup rescan worker
  btrfs: qgroup: exit the rescan worker during umount
  Btrfs: fix extent accounting for partial direct IO writes
2015-11-13 16:30:29 -08:00
Zhao Lei
2c9fe83552 btrfs: Fix lost-data-profile caused by balance bg
Reproduce:
 (In integration-4.3 branch)

 TEST_DEV=(/dev/vdg /dev/vdh)
 TEST_DIR=/mnt/tmp

 umount "$TEST_DEV" >/dev/null
 mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 btrfs balance start -dusage=0 $TEST_DIR
 btrfs filesystem usage $TEST_DIR

 dd if=/dev/zero of="$TEST_DIR"/file count=100
 btrfs filesystem usage $TEST_DIR

Result:
 We can see "no data chunk" in first "btrfs filesystem usage":
 # btrfs filesystem usage $TEST_DIR
 Overall:
    ...
 Metadata,single: Size:8.00MiB, Used:0.00B
    /dev/vdg        8.00MiB
 Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
    /dev/vdg      122.88MiB
    /dev/vdh      122.88MiB
 System,single: Size:4.00MiB, Used:0.00B
    /dev/vdg        4.00MiB
 System,RAID1: Size:8.00MiB, Used:16.00KiB
    /dev/vdg        8.00MiB
    /dev/vdh        8.00MiB
 Unallocated:
    /dev/vdg        1.06GiB
    /dev/vdh        1.07GiB

 And "data chunks changed from raid1 to single" in second
 "btrfs filesystem usage":
 # btrfs filesystem usage $TEST_DIR
 Overall:
    ...
 Data,single: Size:256.00MiB, Used:0.00B
    /dev/vdh      256.00MiB
 Metadata,single: Size:8.00MiB, Used:0.00B
    /dev/vdg        8.00MiB
 Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
    /dev/vdg      122.88MiB
    /dev/vdh      122.88MiB
 System,single: Size:4.00MiB, Used:0.00B
    /dev/vdg        4.00MiB
 System,RAID1: Size:8.00MiB, Used:16.00KiB
    /dev/vdg        8.00MiB
    /dev/vdh        8.00MiB
 Unallocated:
    /dev/vdg        1.06GiB
    /dev/vdh      841.92MiB

Reason:
 btrfs balance delete last data chunk in case of no data in
 the filesystem, then we can see "no data chunk" by "fi usage"
 command.

 And when we do write operation to fs, the only available data
 profile is 0x0, result is all new chunks are allocated single type.

Fix:
 Allocate a data chunk explicitly to ensure we don't lose the
 raid profile for data.

Test:
 Test by above script, and confirmed the logic by debug output.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-11-10 19:27:20 -08:00
Linus Torvalds
ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
David Sterba
bc3094673f btrfs: extend balance filter usage to take minimum and maximum
Similar to the 'limit' filter, we can enhance the 'usage' filter to
accept a range. The change is backward compatible, the range is applied
only in connection with the BTRFS_BALANCE_ARGS_USAGE_RANGE flag.

We don't have a usecase yet, the current syntax has been sufficient. The
enhancement should provide parity with other range-like filters.

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:30 -07:00
Gabríel Arthúr Pétursson
dee32d0ac3 btrfs: add balance filter for stripes
Balance block groups which have the given number of stripes, defined by
a range min..max. This is useful to selectively rebalance only chunks
that do not span enough devices, applies to RAID0/10/5/6.

Signed-off-by: Gabríel Arthúr Pétursson <gabriel@system.is>
[ renamed bargs members, added to the UAPI, wrote the changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:29 -07:00
David Sterba
12907fc798 btrfs: extend balance filter limit to take minimum and maximum
The 'limit' filter is underdesigned, it should have been a range for
[min,max], with some relaxed semantics when one of the bounds is
missing. Besides that, using a full u64 for a single value is a waste of
bytes.

Let's fix both by extending the use of the u64 bytes for the [min,max]
range. This can be done in a backward compatible way, the range will be
interpreted only if the appropriate flag is set
(BTRFS_BALANCE_ARGS_LIMIT_RANGE).

Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-26 19:38:28 -07:00
Josef Bacik
3204d33cda Btrfs: add a flags field to btrfs_transaction
I want to set some per transaction flags, so instead of adding yet another int
lets just convert the current two int indicators to flags and add a flags field
for future use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:51:45 -07:00
Chris Mason
a0d58e48db Merge branch 'cleanups/for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-21 18:21:40 -07:00
Alexandru Moise
bdcd3c97d1 btrfs: cleanup btrfs_balance profile validity checks
Improve readability by generalizing the profile validity checks.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Zhao Lei
8789f4fe60 btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures()
btrfs_raid_array[] is used to define all raid attributes, use it
to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(),
instead of complex condition in function.

It can make code simple and auto-support other possible raid-type in
future.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Zhao Lei
af90204750 btrfs: Move btrfs_raid_array to public
This array is used to record attributes of each raid type,
make it public, and many functions will benifit with this array.

For example, num_tolerated_disk_barrier_failures(), we can
avoid complex conditions in this function, and get raid attribute
simply by accessing above array.

It can also make code logic simple, reduce duplication code, and
increase maintainability.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-21 18:28:48 +02:00
Chris Mason
6db4a7335d Merge branch 'fix/waitqueue-barriers' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 2015-10-12 16:24:40 -07:00
Chris Mason
62fb50ab7c Merge branch 'anand/sysfs-updates-v4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-12 16:24:15 -07:00
David Sterba
ee86395458 btrfs: comment the rest of implicit barriers before waitqueue_active
There are atomic operations that imply the barrier for waitqueue_active
mixed in an if-condition.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-10 18:42:00 +02:00
David Sterba
f14d104dbd btrfs: switch more printks to our helpers
Convert the simple cases, not all functions provide a way to reach the
fs_info. Also skipped debugging messages (print-tree, integrity
checker and pr_debug) and messages that are printed from possibly
unfinished mount.

Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 13:08:03 +02:00
David Sterba
b14af3b46f btrfs: switch message printers to ratelimited _in_rcu variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
David Sterba
ecaeb14b91 btrfs: switch message printers to _in_rcu variants
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08 11:07:55 +02:00
Anand Jain
f190aa471a Btrfs: add helper for closing one device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 18:00:05 +02:00
Anand Jain
097efc966a Btrfs: don't log error from btrfs_get_bdev_and_sb
Originally the message was not in a helper but ended up there. We should
print error messages from callers instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:56:47 +02:00
Anand Jain
12b1c2637b Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks
This patch updates and renames btrfs_scratch_superblocks, (which is used
by the replace device thread), with those fixes from the scratch
superblock code section of btrfs_rm_device(). The fixes are:
  Scratch all copies of superblock
  Notify kobject that superblock has been changed
  Update time on the device

So that btrfs_rm_device() can use the function
btrfs_scratch_superblocks() instead of its own scratch code. And further
replace deivce code which similarly releases device back to the system,
will have the fixes from the btrfs device delete.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed to btrfs_scratch_superblock]
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 17:37:34 +02:00
Anand Jain
d74a625987 Btrfs: use BTRFS_ERROR_DEV_MISSING_NOT_FOUND when missing device is not found
Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead
of -ENOENT.  Next this removes the logging when user specifies "missing"
and we don't find it in the kernel device list. Logging are for system
events not for user input errors.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-01 16:47:16 +02:00
Anand Jain
a4553fefb5 Btrfs: consolidate btrfs_error() to btrfs_std_error()
btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:30:00 +02:00
Anand Jain
92fc03fbdc Btrfs: SB read failure should return EIO for __bread failure
This will return EIO when __bread() fails to read SB,
instead of EINVAL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
c1b7e47459 Btrfs: rename super_kobj to fsid_kobj
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
3257604048 Btrfs: rename btrfs_kobj_rm_device to btrfs_sysfs_rm_device_link
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Anand Jain
e3bd6973bc Btrfs: rename btrfs_kobj_add_device to btrfs_sysfs_add_device_link
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29 16:29:59 +02:00
Linus Torvalds
e91eb6204f Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs cleanups and fixes from Chris Mason:
 "These are small cleanups, and also some fixes for our async worker
  thread initialization.

  I was having some trouble testing these, but it ended up being a
  combination of changing around my test servers and a shiny new
  schedule while atomic from the new start/finish_plug in
  writeback_sb_inodes().

  That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
  changing a bunch of things in my test setup at once it would have been
  really clear.  Fix for writeback_sb_inodes() on the way as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
  btrfs: async_thread: Fix workqueue 'max_active' value when initializing
  btrfs: Add raid56 support for updating  num_tolerated_disk_barrier_failures in btrfs_balance
  btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
  btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
  btrfs: Update out-of-date "skip parity stripe" comment
2015-09-11 12:38:25 -07:00
Linus Torvalds
22365979ab Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This has Jeff Mahoney's long standing trim patch that fixes corners
  where trims were missing.  Omar has some raid5/6 fixes, especially for
  using scrub and device replace when devices are missing.

  Zhao Lie continues cleaning and fixing things, this series fixes some
  really hard to hit corners in xfstests.  I had to pull it last merge
  window due to some deadlocks, but those are now resolved.

  I added support for Tejun's new blkio controllers.  It seems to work
  well for single devices, we'll expand to multi-device as well"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits)
  btrfs: fix compile when block cgroups are not enabled
  Btrfs: fix file read corruption after extent cloning and fsync
  Btrfs: check if previous transaction aborted to avoid fs corruption
  btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
  btrfs: Prevent from early transaction abort
  btrfs: Remove unused arguments in tree-log.c
  btrfs: Remove useless condition in start_log_trans()
  Btrfs: add support for blkio controllers
  Btrfs: remove unused mutex from struct 'btrfs_fs_info'
  Btrfs: fix parity scrub of RAID 5/6 with missing device
  Btrfs: fix device replace of a missing RAID 5/6 device
  Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation
  Btrfs: count devices correctly in readahead during RAID 5/6 replace
  Btrfs: remove misleading handling of missing device scrub
  btrfs: fix clone / extent-same deadlocks
  Btrfs: fix defrag to merge tail file extent
  Btrfs: fix warning in backref walking
  btrfs: Add WARN_ON() for double lock in btrfs_tree_lock()
  btrfs: Remove root argument in extent_data_ref_count()
  btrfs: Fix wrong comment of btrfs_alloc_tree_block()
  ...
2015-09-05 15:14:43 -07:00
Zhao Lei
943c6e9925 btrfs: Add raid56 support for updating
num_tolerated_disk_barrier_failures in btrfs_balance

Code for updating fs_info->num_tolerated_disk_barrier_failures in
btrfs_balance() lacks raid56 support.

Reason:
 Above code was wroten in 2012-08-01, together with
 btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.

 Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
 later to support raid56, but code in btrfs_balance() was not
 updated together.

Fix:
 Merge above similar code to a common function:
 btrfs_get_num_tolerated_disk_barrier_failures()
 and make it support both case.

 It can fix this bug with a bonus of cleanup, and make these code
 never in above no-sync state from now on.

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:48 -07:00
Chris Mason
3a9508b022 btrfs: fix compile when block cgroups are not enabled
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
This adds an ifdef around them.  It's not perfect, but our
use of bi_ioc is being removed in the 4.3 merge window.

The bi_css usage really should go into bio_clone, but I want to make
sure that doesn't introduce problems for other bio_clone use cases.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-21 10:08:13 -07:00
Michal Hocko
277fb5fc17 btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
transaction but this allocation context is rather weak wrt. reclaim
capabilities. The page allocator currently tries hard to not fail these
allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
still fail if the _current_ process is the OOM killer victim. Moreover
there is an attempt to move away from the default no-fail behavior and
allow these allocation to fail more eagerly. This would lead to:

[   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045

which is clearly undesirable and the nofail behavior should be explicit
if the allocation failure cannot be tolerated.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19 14:25:15 -07:00
Kent Overstreet
0e28997ec4 btrfs: remove bio splitting and merge_bvec_fn() calls
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13 12:31:43 -06:00
Chris Mason
46cd28555f Merge branch 'jeffm-discard-4.3' into for-linus-4.3 2015-08-09 07:35:33 -07:00
Chris Mason
da2f0f74cf Btrfs: add support for blkio controllers
This attaches accounting information to bios as we submit them so the
new blkio controllers can throttle on btrfs filesystems.

Not much is required, we're just associating bios with blkcgs during clone,
calling wbc_init_bio()/wbc_account_io() during writepages submission,
and attaching the bios to the current context during direct IO.

Finally if we are splitting bios during btrfs_map_bio, this attaches
accounting information to the split.

The end result is able to throttle nicely on single disk filesystems.  A
little more work is required for multi-device filesystems.

Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:35:06 -07:00
Zhaolei
dc2ee4e244 btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk()
Remove chunk_objectid argument from btrfs_relocate_chunk() because
it is not necessary, it can also cleanup some code in caller for
prepare its value.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Zhaolei
55e3a601c8 btrfs: Fix data checksum error cause by replace with io-load.
xfstests btrfs/070 sometimes failed.
In my test machine, its fail rate is about 30%.
In another vm(vmware), its fail rate is about 50%.

Reason:
  btrfs/070 do replace and defrag with fsstress simultaneously,
  after above operation, checksum error is found by scrub.

  Actually, it have no relationship with defrag operation, only
  replace with fsstress can trigger this bug.

  New data writen to target device have possibility rewrited by
  old data from source device by replace code in debug, to avoid
  above problem, we can set target block group to readonly in
  replace period, so new data requested by other operation will
  not write to same place with replace code.

  Before patch(4.1-rc3):
    30% failed in 100 xfstests.
  After patch:
    0% failed in 300 xfstests.

It also happened in btrfs/071 as it's another scrub with IO load tests.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09 07:07:13 -07:00
Jeff Mahoney
499f377f49 btrfs: iterate over unused chunk space in FITRIM
Since we now clean up block groups automatically as they become
empty, iterating over block groups is no longer sufficient to discard
unused space.

This patch iterates over the unused chunk space and discards any regions
that are unallocated, regardless of whether they were ever used.  This is
a change for btrfs but is consistent with other file systems.

We do this in a transactionless manner since the discard process can take
a substantial amount of time and a transaction would need to be started
before the acquisition of the device list lock.  That would mean a
transaction would be held open across /all/ of the discards collectively.
In order to prevent other threads from allocating or freeing chunks, we
hold the chunks lock across the search and discard calls.  We release it
between searches to allow the file system to perform more-or-less
normally.  Since the running transaction can commit and disappear while
we're using the transaction pointer, we take a reference to it and
release it after the search.  This is safe since it would happen normally
at the end of the transaction commit after any locks are released anyway.
We also take the commit_root_sem to protect against a transaction starting
and committing while we're running.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29 08:15:26 -07:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Linus Torvalds
31b7a57c9e Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of fixes.  Most of the commits are from Filipe
  (fsync, the inode allocation cache and a few others).  Mark kicked in
  a series fixing corners in the extent sharing ioctls, and everyone
  else fixed up on assorted other problems"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong check for btrfs_force_chunk_alloc()
  Btrfs: fix warning of bytes_may_use
  Btrfs: fix hang when failing to submit bio of directIO
  Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()
  Btrfs: fix memory corruption on failure to submit bio for direct IO
  btrfs: don't update mtime/ctime on deduped inodes
  btrfs: allow dedupe of same inode
  btrfs: fix deadlock with extent-same and readpage
  btrfs: pass unaligned length to btrfs_cmp_data()
  Btrfs: fix fsync after truncate when no_holes feature is enabled
  Btrfs: fix fsync xattr loss in the fast fsync path
  Btrfs: fix fsync data loss after append write
  Btrfs: fix crash on close_ctree() if cleaner starts new transaction
  Btrfs: fix race between caching kthread and returning inode to inode cache
  Btrfs: use kmem_cache_free when freeing entry in inode cache
  Btrfs: fix race between balance and unused block group deletion
  btrfs: add error handling for scrub_workers_get()
  btrfs: cleanup noused initialization of dev in btrfs_end_bio()
  btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-11 10:26:34 -07:00
Linus Torvalds
043cd04950 Merge branch 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "Outside of our usual batch of fixes, this integrates the subvolume
  quota updates that Qu Wenruo from Fujitsu has been working on for a
  few releases now.  He gets an extra gold star for making btrfs smaller
  this time, and fixing a number of quota corners in the process.

  Dave Sterba tested and integrated Anand Jain's sysfs improvements.
  Outside of exporting a symbol (ack'd by Greg) these are all internal
  to btrfs and it's mostly cleanups and fixes.  Anand also attached some
  of our sysfs objects to our internal device management structs instead
  of an object off the super block.  It will make device management
  easier overall and it's a better fit for how the sysfs files are used.
  None of the existing sysfs files are moved around.

  Thanks for all the fixes everyone"

* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits)
  btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref()
  Btrfs: Check if kobject is initialized before put
  lib: export symbol kobject_move()
  Btrfs: sysfs: add support to show replacing target in the sysfs
  Btrfs: free the stale device
  Btrfs: use received_uuid of parent during send
  Btrfs: fix use-after-free in btrfs_replay_log
  btrfs: wait for delayed iputs on no space
  btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup.
  btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.
  btrfs: ulist: Add ulist_del() function.
  btrfs: qgroup: Cleanup the old ref_node-oriented mechanism.
  btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.
  btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.
  btrfs: qgroup: Switch rescan to new mechanism.
  btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents().
  btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots().
  btrfs: qgroup: Add new function to record old_roots.
  btrfs: qgroup: Record possible quota-related extent for qgroup.
  btrfs: qgroup: Add function qgroup_update_counters().
  ...
2015-06-30 20:07:45 -07:00
Filipe Manana
67c5e7d464 Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:

[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)

The sequence of steps leading to this are:

           CPU 0                                         CPU 1

  btrfs_balance()
    btrfs_relocate_chunk()

      btrfs_relocate_block_group(bg X)
        btrfs_lookup_block_group(bg X)

                                               cleaner_kthread
                                                  locks fs_info->cleaner_mutex

                                                  btrfs_delete_unused_bgs()
                                                    finds bg X, which became
                                                    unused in the previous
                                                    transaction

                                                    checks bg X ->ro == 0,
                                                    so it proceeds
        sets bg X ->ro to 1
        (btrfs_set_block_group_ro(bg X))

        blocks on fs_info->cleaner_mutex
                                                    btrfs_remove_chunk(bg X)
                                                  unlocks fs_info->cleaner_mutex

        acquires fs_info->cleaner_mutex
        relocate_block_group()
          --> does nothing, no extents found in
              the extent tree from bg X
        unlocks fs_info->cleaner_mutex

      btrfs_relocate_block_group(bg X) returns

    btrfs_remove_chunk(bg X)
       extent map not found
          --> ASSERT(0)

Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.

This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:

    while true; do btrfs balance start -dusage=0 <mountpoint> ; done

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Zhao Lei
65f5333875 btrfs: cleanup noused initialization of dev in btrfs_end_bio()
It is introduced by:
 c404e0dc2c
 Btrfs: fix use-after-free in the finishing procedure of the device replace

But seems no relationship with that bug, this patch revirt these
code block for cleanup.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:02 -07:00
Linus Torvalds
bfffa1cc9d Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
 "Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...
2015-06-25 14:29:53 -07:00